problem description

Atherosclerotic Cardiovascular Disease (ASCVD), which encompasses coronary heart disease, cerebrovascular disease, and peripheral arterial disease, is a leading cause of morbidity and mortality worldwide. Current ASCVD risk assessment tools, while useful, have limitations and may not adequately consider the complex, multifactorial nature of the disease. As such, there is a growing need for more sophisticated predictive models that integrate a wider array of clinical and demographic variables to identify individuals at risk for ASCVD more accurately and earlier in the disease process.

Goal of Collecting this Dataset:

Our goal is to predict 10-year ASCVD risk in adults using key features such as age, gender, race, smoking status, diabetes, hypertension, and cholesterol levels. The dataset aims to facilitate accurate risk assessments and guide targeted preventive healthcare interventions.

The objectives of the project include:

  1. Employing decision tree learning algorithms that can uncover intricate patterns and interactions among diverse risk factors.

  2. Enhancing the precision and personalization of ASCVD risk prediction beyond what is possible with conventional risk assessment tools.

Data mining task

To employ an advanced predictive model for ASCVD risk assessment decision tree learning and clustering algorithms that can identify complex patterns in our dataset, capturing interactions among a multitude of risk factors for ASCVD.

Source of the dataset: HeartRisk

It consist of 1000 row that each have 10 attributes.

Class label:

“Risk”; 10-year risk for ASCVD which is categorized as:

Low-risk (<5%)
Borderline risk (5% to 7.4%)
Intermediate risk (7.5% to 19.9%)
High risk (≥20%)

data

Type of attributes:

##         Attribute_Name          Description         Data_Type
## 1               isMale               Gender            Binary
## 2              isBlack                 Race            Binary
## 3             isSmoker       Smoking Status            Binary
## 4           isDiabetic      Diabetes Status            Binary
## 5       isHypertensive  Hypertension Status            Binary
## 6                  Age Age of the candidate Numeric (Integer)
## 7             Systolic   Max Blood Pressure Numeric (Integer)
## 8          Cholesterol    Total Cholesterol Numeric (Integer)
## 9                  HDL      HDL Cholesterol Numeric (Integer)
## 10 Risk  (class label)   10-year ASCVD Risk Numeric (Decimal)
##                             Possible_Values
## 1                      0 (Female), 1 (Male)
## 2                  0 (Not Black), 1 (Black)
## 3                0 (Non-smoker), 1 (Smoker)
## 4                  0 (Normal), 1 (Diabetic)
## 5                0 (Normal BP), 1 (High BP)
## 6                       Range between 40-79
## 7                      Range between 90-200
## 8                     Range between 130-200
## 9                      Range between 20-100
## 10 Low, Borderline, Intermediate, High risk

Table that shows our Dataset before any modifications:

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
dataset <- read.csv("heartRisk.csv")
head(dataset)

The structure of the dataset provides a top-level view of the variables it contains, by understanding the structure of the dataset and the attributes it contains, we can better analyze and interpret the data to gain insights into the relationship between these variables and the 10-year ASCVD risk:

str(dataset)
## 'data.frame':    1000 obs. of  10 variables:
##  $ isMale        : int  1 0 0 1 0 0 1 1 0 1 ...
##  $ isBlack       : int  1 0 1 1 0 0 0 0 0 0 ...
##  $ isSmoker      : int  0 0 1 1 1 1 1 1 1 0 ...
##  $ isDiabetic    : int  1 1 1 1 0 0 0 1 0 1 ...
##  $ isHypertensive: int  1 1 1 0 1 1 0 0 1 1 ...
##  $ Age           : int  49 69 50 42 66 52 40 75 42 65 ...
##  $ Systolic      : int  101 167 181 145 134 154 104 136 169 196 ...
##  $ Cholesterol   : int  181 155 147 166 199 174 187 189 179 187 ...
##  $ HDL           : int  32 59 59 46 63 22 52 59 99 46 ...
##  $ Risk          : num  11.1 30.1 37.6 13.2 15.1 17.3 2.1 46 1.7 48.5 ...

Dataset dimensions:

dim(dataset)
## [1] 1000   10
  • Number of rows = 1000 , Number of columns = 10

To summarize the descriptive statistics for all the columns in the dataset, we can calculate various statistical measures for each attribute:

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
describe(dataset)
## dataset 
## 
##  10  Variables      1000  Observations
## --------------------------------------------------------------------------------
## isMale 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1000        0        2     0.75      490     0.49   0.5003 
## 
## --------------------------------------------------------------------------------
## isBlack 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1000        0        2    0.747      530     0.53   0.4987 
## 
## --------------------------------------------------------------------------------
## isSmoker 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1000        0        2    0.749      516    0.516      0.5 
## 
## --------------------------------------------------------------------------------
## isDiabetic 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1000        0        2    0.749      522    0.522   0.4995 
## 
## --------------------------------------------------------------------------------
## isHypertensive 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1000        0        2     0.75      495    0.495   0.5005 
## 
## --------------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0       40    0.999    59.11    13.32       42       43 
##      .25      .50      .75      .90      .95 
##       49       59       69       75       77 
## 
## lowest : 40 41 42 43 44, highest: 75 76 77 78 79
## --------------------------------------------------------------------------------
## Systolic 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      111        1    144.2    36.69       95      102 
##      .25      .50      .75      .90      .95 
##      117      144      171      189      194 
## 
## lowest :  90  91  92  93  94, highest: 196 197 198 199 200
## --------------------------------------------------------------------------------
## Cholesterol 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0       71        1      164    23.48      133      136 
##      .25      .50      .75      .90      .95 
##      146      164      182      192      196 
## 
## lowest : 130 131 132 133 134, highest: 196 197 198 199 200
## --------------------------------------------------------------------------------
## HDL 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0       81        1     59.6    27.56       23       27 
##      .25      .50      .75      .90      .95 
##       39       59       81       93       97 
## 
## lowest :  20  21  22  23  24, highest:  96  97  98  99 100
## --------------------------------------------------------------------------------
## Risk 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      439        1    19.67    18.37     1.20     2.20 
##      .25      .50      .75      .90      .95 
##     6.30    14.40    29.00    45.13    55.30 
## 
## lowest : 0.1  0.2  0.3  0.4  0.5 , highest: 76.5 76.8 78.1 78.5 85.4
## --------------------------------------------------------------------------------

To have a better understanding of the values in our Dataset, we applied various statistical measures to the attributes. These measures provide insights into different aspects of the data:

summary(dataset)
##      isMale        isBlack        isSmoker       isDiabetic    isHypertensive 
##  Min.   :0.00   Min.   :0.00   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000  
##  Median :0.00   Median :1.00   Median :1.000   Median :1.000   Median :0.000  
##  Mean   :0.49   Mean   :0.53   Mean   :0.516   Mean   :0.522   Mean   :0.495  
##  3rd Qu.:1.00   3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :1.00   Max.   :1.00   Max.   :1.000   Max.   :1.000   Max.   :1.000  
##       Age           Systolic      Cholesterol       HDL             Risk      
##  Min.   :40.00   Min.   : 90.0   Min.   :130   Min.   : 20.0   Min.   : 0.10  
##  1st Qu.:49.00   1st Qu.:117.0   1st Qu.:146   1st Qu.: 39.0   1st Qu.: 6.30  
##  Median :59.00   Median :144.0   Median :164   Median : 59.0   Median :14.40  
##  Mean   :59.11   Mean   :144.2   Mean   :164   Mean   : 59.6   Mean   :19.67  
##  3rd Qu.:69.00   3rd Qu.:171.0   3rd Qu.:182   3rd Qu.: 81.0   3rd Qu.:29.00  
##  Max.   :79.00   Max.   :200.0   Max.   :200   Max.   :100.0   Max.   :85.40

We measured the Variance for all numeric attributes to see the degree of spread in the dataset:

var(dataset$Age)
## [1] 133.0906
var(dataset$Systolic)
## [1] 1009.621
var(dataset$Cholesterol)
## [1] 413.3045
var(dataset$HDL)
## [1] 569.4669
var(dataset$Risk)
## [1] 290.4959

All the attributes’ variance results are higher than their mean values, which implies that the dataset has greater variability and is more heterogeneous. This might indicate that the values in our dataset are more scattered; have a wider range of values, potentially suggesting a more diverse or varied pattern in the data.

Scatter Plot:

library(ggplot2)

ggplot(dataset, aes(x = Age, y =Systolic, color= 'red'))+
  geom_point() +
  xlab("Age") +
  ylab("Blood Pressure")

In order to gain a deeper understanding of our dataset, we examined the attributes “Systolic” and “Age” to determine if there was a predictive or correlational relationship between them. However, after analyzing the scatter plot, we discovered that there is no discernible relationship or correlation between these two attributes.

ggplot(dataset, aes(x = Systolic, y = Risk)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, aes(color = "Regression Line")) +
  facet_wrap(~cut(Age, 3), scales = "free") +
  xlab("Systolic Blood Pressure") +
  ylab("Risk") +
  ggtitle("Relationship between Systolic Blood Pressure and Risk at Different Age Levels") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

However, notable association between ‘Systolic Blood Pressure’, ‘Age’, and ‘Risk’, segmented into various age categories. It shows that risk notably rises with age and Blood Pressure,as the regression line for the age bracket (66,79] exhibits higher risks, indicating a high correlation between advancing age and elevated risk levels in this dataset.

Density Plot:

library(tidyr)

dataset_long <- gather(dataset, key = "column", value = "value", Age:ncol(dataset))

ggplot(dataset_long, aes(x = value, fill = column)) +
  geom_density(alpha = 0.7) +
  facet_wrap(~column, scales = "free") +
  xlab("Value") +
  ylab("Density")

To understand the relative frequency of different values within our dataest we measeured the density, and analyzed the corresponding graphs. Here are the observations we made:

- The graph representing the distribution of ages shows a reasonable representation of ages between 40 and 80 within the dataset. This suggests that the age values are well-distributed within this range.

- Both the density graphs for cholesterol and HDL indicate a slight skew towards lower cholesterol levels. This suggests that the majority of the data points tend to have lower cholesterol values rather than higher ones.

- The density graph for systolic blood pressure displays a uniform distribution across the entire range of blood pressures. This indicates that the data points are evenly spread out without any significant concentration in specific pressure ranges.

- The density graph for the risk variable exhibits a positively skewed (right-skewed) distribution. This implies that there is a higher frequency of data points with lower risk values, while the occurrence of higher risk values is relatively less frequent.

Bar Plot visualization for the ‘isSmoker’ attribute:

bb <- dataset$isSmoker %>% table() %>%
barplot(bb , col = c("lightgreen","darkred"), width= c(4,4.1),space=0.1, names.arg=c("o","1"), legend.text = c("Non-Smoker","Smoker"))

To better understand the smoking status within our dataset, we visualized the data using a bar plot. This visualization was chosen to provide a clear and easily interpretable representation of the differences in smoking status. From the bar plot, we observed that the numbers are nearly evenly distributed between non-smokers (0) and smokers (1). This indicates that there is a balanced representation of individuals who are non-smokers and smokers in the dataset.

Matrix measurement of the correlation in our dataset:

library(corrplot)
## corrplot 0.92 loaded
corr_matrix <- cor(dataset)
corrplot(corr_matrix, method = "color", type = "lower", tl.col = "black", tl.srt = 45, 
          addCoef.col = "black", number.cex = 0.7, tl.cex = 0.7, col = colorRampPalette(c("white", "lightblue"))(90))
## Warning in ind1:ind2: numerical expression has 2 elements: only the first used

By analyzing the correlation matrix of our dataset, we can identify suspicious events and patterns in the data. However, it is evident that there are no strong correlations among the features in the dataset. Despite this, we can rank the correlations in descending order based on their impact on the risk of heart disease.The order of correlations, from highest to lowest in terms of their influence on heart disease risk, is as follows: Age, Systolic blood pressure, is Diabetic, is Smoker, is Hypertensive, gender is male, , race is black, Cholestrol, HDL.

Box Plot:

boxplot(dataset$Age)

The Age boxplot shows a wide range of values that might lead to a lower accuracy of the results when it comes to clculations so we need change it to a standardized range. Additionally, the boxplot analysis indicates that there are no outliers present in the Age attribute. This implies that the Age data points are within a reasonable range and do not deviate significantly from the overall distribution of values.

boxplot(dataset$Systolic)

The boxplot analysis of the Systolic blood pressure attribute reveals the absence of outliers, indicating that the data points are within a reasonable range without any extreme values. However, it is worth noting that the range of Systolic blood pressure is considerably large. To ensure accurate calculations and mitigate potential conflicts, it is recommended to transform the Systolic blood pressure into a smaller and standardized range. This transformation will help normalize the data and make it more suitable for analysis and calculations.

boxplot(dataset$Cholesterol)

According to the boxplot analysis of the Cholesterol attribute, no outliers are observed, suggesting that the data points are within a reasonable range without any extreme values. However, it is important to narrow down the range of values to optimize the accuracy of our calculations. By reducing the range of Cholesterol values, we can improve the reliability and precision of our dataset, enabling us to obtain more reliable and meaningful results.

boxplot(dataset$HDL)

The HDL boxplot reveal that there are no outlires shown. However, it is necessary to transform the range of HDL values to bring them into a standardized and common range. By performing this transformation, we can almost ensure to have better insights and improved data quality.


data preprocessing

2-Data cleaning

2.1 Missing values:

Since missing/null values can affect the dataset badly we decided to check it and delete all missing/null values from our dataset to make it as clean as possible so that we can end up with efficint dataset resulting to a higher possibiliaty of accurete results later on.

# Check for missing values
missing_values <- colSums(is.na(dataset))

# Print columns with missing values
print("Columns with missing values:")
## [1] "Columns with missing values:"
print(names(missing_values)[missing_values > 0])
## character(0)
# Print the count of missing values for each column
print("Count of missing values for each column:")
## [1] "Count of missing values for each column:"
print(missing_values)
##         isMale        isBlack       isSmoker     isDiabetic isHypertensive 
##              0              0              0              0              0 
##            Age       Systolic    Cholesterol            HDL           Risk 
##              0              0              0              0              0
The analysis revealed that there are no missing values across any of the attributes.

2.2 Detecting and removing outliers:

In data analysis, checking and removing outliers is crucial to ensure the reliability of statistical insights. Outliers, as extreme data points, can distort summary statistics, potentially leading to inaccurate analyses. By identifying and, if necessary, removing outliers, we enhance the robustness of our findings.

# Compute IRQ
Q1 <- quantile(dataset$Age, 0.25)
Q3 <- quantile(dataset$Age, 0.75)
IQR <- Q3 - Q1

# Identify outliers
lower_bound <- Q1 - (1.5 * IQR)
upper_bound <- Q3 + (1.5 * IQR)
outliers <- which(dataset$Age < lower_bound | dataset$Age > upper_bound)

# Get the number of outliers
num_outliers <- length(outliers)
print(paste("Number of Age outliers:", num_outliers))
## [1] "Number of Age outliers: 0"
# Compute IRQ
Q1 <- quantile(dataset$Systolic, 0.25)
Q3 <- quantile(dataset$Systolic, 0.75)
IQR <- Q3 - Q1

# Identify outliers
lower_bound <- Q1 - (1.5 * IQR)
upper_bound <- Q3 + (1.5 * IQR)
outliers <- which(dataset$Systolic < lower_bound | dataset$Systolic > upper_bound)

# Get the number of outliers
num_outliers <- length(outliers)
print(paste("Number of Systolic outliers:", num_outliers))
## [1] "Number of Systolic outliers: 0"
# Compute IRQ
Q1 <- quantile(dataset$Cholesterol, 0.25)
Q3 <- quantile(dataset$Cholesterol, 0.75)
IQR <- Q3 - Q1

# Identify outliers
lower_bound <- Q1 - (1.5 * IQR)
upper_bound <- Q3 + (1.5 * IQR)
outliers <- which(dataset$Cholesterol < lower_bound | dataset$Cholesterol > upper_bound)

# Get the number of outliers
num_outliers <- length(outliers)
print(paste("Number of Cholesterol outliers:", num_outliers))
## [1] "Number of Cholesterol outliers: 0"
# Compute IRQ
Q1 <- quantile(dataset$HDL, 0.25)
Q3 <- quantile(dataset$HDL, 0.75)
IQR <- Q3 - Q1

# Identify outliers
lower_bound <- Q1 - (1.5 * IQR)
upper_bound <- Q3 + (1.5 * IQR)
outliers <- which(dataset$HDL < lower_bound | dataset$HDL > upper_bound)

# Get the number of outliers
num_outliers <- length(outliers)
print(paste("Number of HDL outliers:", num_outliers))
## [1] "Number of HDL outliers: 0"

The result indicates that there are no outliers, but we will also use a box plot to ensure that there are no outliers.

boxplot(dataset[,c(6,7,8,9)], main="Boxplot with Outliers", col=c("lightblue","lightblue","lightblue","lightblue"))

By using the box plot we can see that there are no outliers in the data set.


3-Data reduction

In analyzing the dataset,The initial dataset provided a comprehensive and relevant set of information for the research objectives without the need for removal or condensation of variables.

used the findCorrelation function in caret library to outputs the index of variables we need to delete. targeting any pair with a correlation coefficient exceeding 0.75.

findCorrelation(cor(dataset), cutoff=0.75)
## integer(0)

In our case, the function finds out that no feature need to be deleted.


4-Data transformation

4.1 normalization

Data normalization is a preprocessing step that involves transforming numerical data within a dataset to a standard, uniform scale. This process ensures that all variables, regardless of their original units or scales, are brought into a consistent and comparable range. the following attributes were selected for normalization:(age, systolic, cholestrol, HDL)

normalize <- function(x)
{
  return ((x - min(x))/ (max(x)- min(x)) )
}

dataset$Age<-normalize(dataset$Age)
dataset$Systolic<-normalize(dataset$Systolic)
dataset$Cholesterol<-normalize(dataset$Cholesterol)
dataset$HDL<-normalize(dataset$HDL)

head(dataset)

we have successfully completed the data normalization. This process entailed scaling our numerical features to a standardized range, typically between 0 and 1.

4.2 Discretization

To make our dataset understandable and easily interpretable, especially when using tree-based classification methods, we transformed the continuous class label ‘Risk’ into specific, categorized risk levels.

These levels are delineated as:

Low risk (<5%), Borderline risk (5% to 7.4%), Intermediate risk (7.5% to 19.9%), and High risk (≥20%).

# Categorize 'Risk' into defined categories
dataset$Risk <- cut(
  dataset$Risk, 
  breaks = c(-Inf, 5, 7.4, 19.9, Inf),
  labels = c("Low risk", "Borderline risk", "Intermediate risk", "High risk"),
  right = FALSE,
  include.lowest = TRUE
)

our dataset after Discretization:

head(dataset)

5- Feature selection

Feature selection is one of the most important task to boost performance of our machine learning model by removing irrelevant features the model will make decisions only using important features. we will use Recursive Feature Elimination (RFE), which is a widely used wrapper-type algorithm for selecting features that are most relevant in predicting the target variable ‘Risk’ in our case.

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-2
# ensure results are repeatable
set.seed(7)

# Define RFE control parameters
ctrl <- rfeControl(functions=rfFuncs, method="cv", number=10)

# Execute RFE using dataset features 1-9 and "Risk" as the class lable
results <- rfe(dataset[,1:9], dataset$Risk, sizes=c(1:9), rfeControl=ctrl)

# Display RFE results
print(results)
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          1   0.5832 0.3887    0.03859 0.05413         
##          2   0.5489 0.3401    0.03516 0.05332         
##          3   0.6230 0.4335    0.03123 0.04525         
##          4   0.6671 0.5073    0.04478 0.06397         
##          5   0.6770 0.5222    0.02512 0.03598         
##          6   0.7132 0.5739    0.03336 0.05041         
##          7   0.7821 0.6764    0.03986 0.05887         
##          8   0.7812 0.6748    0.03076 0.04539         
##          9   0.8009 0.7051    0.02630 0.03865        *
## 
## The top 5 variables (out of 9):
##    Age, Systolic, isDiabetic, isSmoker, isMale
plot(results, type=c("g", "o"))

The asterisk (*) in the column indicates the number of features recommended by RFE as yielding the best model according to the resampling results. it shows that when 9 variables are used, the model achieves the best accuracy of approximately 80% and a kappa value of 0.7.

The graphical representation of feature importance :

The “Mean Decrease Gini” score tells us how crucial a feature is for making accurate predictions in a Random Forest model. A higher score means the feature is more valuable in deciding how to classify the data correctly, helping the model make better decisions.

# Setting seed for reproducibility
set.seed(123)

# Fit a random forest model
rf_model <- randomForest(Risk ~ ., data = dataset)
var_imp <- importance(rf_model)
var_imp_df <- data.frame(variables = row.names(var_imp), var_imp)

# Sorting variables based on importance
var_imp_df <- var_imp_df[order(var_imp_df$MeanDecreaseGini, decreasing = TRUE),]

# Plotting variable importance using ggplot2
ggplot(var_imp_df, aes(x = reorder(variables, MeanDecreaseGini), y = MeanDecreaseGini)) +geom_col() +
  coord_flip() +
  labs(title = "Feature Importance",
       x = "Features",
      y = "Importance (Mean Decrease in Gini)")

The graph shows that ‘Age’ and ‘Systolic’ are key variables influencing our model’s predictions of ‘Risk’. while variables like isHypertensive, isBlack were found to have the least impact on the model’s predictive capability.

Overall, we think it’s a good practice to make use of all our features as recommended by RFE, particularly when we are dealing with a modest number, to avoid potential overfitting.we


phase-3

balancing data

Balancing data is crucial for improving the performance and fairness of machine learning models. When data are imbalanced, with one class significantly outnumbering the others, models tend to bias towards the majority class, leading to poor predictive accuracy for minority classes.

Before balancing our data:

# Calculate class distribution
class_distribution <- table(dataset$Risk)
# Create a bar plot
barplot(class_distribution, 
        main = "Class Distribution for Risk",
        xlab = "Risk Level",
        ylab = "Count",
        names.arg = levels(dataset$Risk))

After balancing our data:

library(ROSE)
## Loaded ROSE 0.0-4
balanced_data <- upSample(dataset[, 1:9], dataset$Risk, yname = "Risk")
# Plot the distribution of the "Risk" classes
plot(balanced_data$Risk)

# Check the proportion and count of "Risk" classes
prop_table <- prop.table(table(balanced_data$Risk))
count_table <- table(balanced_data$Risk)

After balancing our data, the model becomes more capable of providing accurate predictions, ensuring a fair evaluation of their performance.


Data Mining Techniques and analysis

6- Classification

Classification analysis is a fundamental aspect of machine learning, focusing on categorizing data into distinct classes. In our study, we aim to build predictive models that efficiently assign predefined labels to new instances based on their features. To enhance the robustness of our models, we have divided the dataset into three sets: training, validation, and testing. By employing different proportions of training data—60%, 70%, and 80%—we seek to evaluate and compare the models’ performances. This approach ensures a comprehensive understanding of model behavior under varying training scenarios, guiding us to select the most effective model for our specific dataset.

-Decision tree using Gain ratio (C4.5):

Gain ratio is a metric that assesses the quality of a split within decision tree algorithms. to evaluate the quality of a split based on the information gain and the intrinsic information of a feature.we have implemented the Gain Ratio (C4.5) algorithm and the J48 function from the RWeka package. This algorithm partitions our data into training and testing sets, builds a J48 decision tree on the training data,

1-partition the data into ( 60% training, 40% testing):

# Load the RWeka package
library(RWeka)
set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.60 , 0.40))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]

# Define the formula
myFormula <- Risk ~ .

# Build the J48 decision tree on the training data
C45Fit <- J48(myFormula, data = trainData)

# Create a table to compare predicted vs. actual values on the training data
table(predict(C45Fit), trainData$Risk)
##                    
##                     Low risk Borderline risk Intermediate risk High risk
##   Low risk               240               1                 3         1
##   Borderline risk          6             217                 5         1
##   Intermediate risk        0               0               225        13
##   High risk                0               3                17       227
# Print a summary of the J48 model
print(C45Fit)
## J48 pruned tree
## ------------------
## 
## Age <= 0.564103
## |   HDL <= 0.225
## |   |   Systolic <= 0.545455
## |   |   |   isHypertensive <= 0
## |   |   |   |   Age <= 0.025641: Low risk (6.0)
## |   |   |   |   Age > 0.025641
## |   |   |   |   |   HDL <= 0.0125: Intermediate risk (6.0)
## |   |   |   |   |   HDL > 0.0125
## |   |   |   |   |   |   Cholesterol <= 0.2
## |   |   |   |   |   |   |   Systolic <= 0.290909: Low risk (3.0)
## |   |   |   |   |   |   |   Systolic > 0.290909: Intermediate risk (3.0)
## |   |   |   |   |   |   Cholesterol > 0.2
## |   |   |   |   |   |   |   Age <= 0.128205
## |   |   |   |   |   |   |   |   isBlack <= 0: Borderline risk (2.0)
## |   |   |   |   |   |   |   |   isBlack > 0
## |   |   |   |   |   |   |   |   |   Age <= 0.051282: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   |   Age > 0.051282: Low risk (2.0)
## |   |   |   |   |   |   |   Age > 0.128205
## |   |   |   |   |   |   |   |   Age <= 0.435897: Borderline risk (18.0)
## |   |   |   |   |   |   |   |   Age > 0.435897
## |   |   |   |   |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   |   |   |   |   Age <= 0.461538: Borderline risk (3.0)
## |   |   |   |   |   |   |   |   |   |   Age > 0.461538
## |   |   |   |   |   |   |   |   |   |   |   Systolic <= 0.190909: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   |   |   |   Systolic > 0.190909: Borderline risk (3.0)
## |   |   |   |   |   |   |   |   |   isDiabetic > 0: Intermediate risk (2.0)
## |   |   |   isHypertensive > 0
## |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   Systolic <= 0.309091
## |   |   |   |   |   |   Systolic <= 0.190909
## |   |   |   |   |   |   |   isMale <= 0: Low risk (6.0)
## |   |   |   |   |   |   |   isMale > 0: Intermediate risk (3.0)
## |   |   |   |   |   |   Systolic > 0.190909: Borderline risk (5.0/1.0)
## |   |   |   |   |   Systolic > 0.309091: Intermediate risk (10.0)
## |   |   |   |   isDiabetic > 0
## |   |   |   |   |   isMale <= 0: Intermediate risk (11.0/1.0)
## |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   isBlack <= 0: High risk (3.0/1.0)
## |   |   |   |   |   |   isBlack > 0: Intermediate risk (4.0/1.0)
## |   |   Systolic > 0.545455
## |   |   |   Cholesterol <= 0.014286: Borderline risk (5.0/1.0)
## |   |   |   Cholesterol > 0.014286
## |   |   |   |   isSmoker <= 0
## |   |   |   |   |   Systolic <= 0.681818: High risk (3.0)
## |   |   |   |   |   Systolic > 0.681818
## |   |   |   |   |   |   Age <= 0.461538: Intermediate risk (9.0)
## |   |   |   |   |   |   Age > 0.461538: High risk (3.0/1.0)
## |   |   |   |   isSmoker > 0: High risk (29.0/6.0)
## |   HDL > 0.225
## |   |   Age <= 0.282051
## |   |   |   isBlack <= 0
## |   |   |   |   Cholesterol <= 0.557143
## |   |   |   |   |   Systolic <= 0.718182: Low risk (78.0)
## |   |   |   |   |   Systolic > 0.718182
## |   |   |   |   |   |   isDiabetic <= 0: Low risk (18.0/1.0)
## |   |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   |   HDL <= 0.6375
## |   |   |   |   |   |   |   |   Systolic <= 0.909091: Low risk (5.0)
## |   |   |   |   |   |   |   |   Systolic > 0.909091: Intermediate risk (2.0)
## |   |   |   |   |   |   |   HDL > 0.6375: Borderline risk (11.0)
## |   |   |   |   Cholesterol > 0.557143
## |   |   |   |   |   Systolic <= 0.163636: Low risk (5.0)
## |   |   |   |   |   Systolic > 0.163636
## |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   Age <= 0.230769: Low risk (15.0)
## |   |   |   |   |   |   |   Age > 0.230769: Borderline risk (8.0/1.0)
## |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   HDL <= 0.7375: Borderline risk (33.0/4.0)
## |   |   |   |   |   |   |   HDL > 0.7375: Low risk (8.0/1.0)
## |   |   |   isBlack > 0
## |   |   |   |   Systolic <= 0.536364
## |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   Cholesterol <= 0.828571: Low risk (30.0/1.0)
## |   |   |   |   |   |   Cholesterol > 0.828571: Borderline risk (2.0)
## |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   |   isHypertensive <= 0: Low risk (9.0)
## |   |   |   |   |   |   |   |   isHypertensive > 0: Borderline risk (6.0/1.0)
## |   |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   |   Age <= 0.179487: Borderline risk (12.0/1.0)
## |   |   |   |   |   |   |   |   Age > 0.179487: Intermediate risk (2.0)
## |   |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   |   Systolic <= 0.072727: Low risk (4.0)
## |   |   |   |   |   |   |   Systolic > 0.072727: Intermediate risk (9.0/1.0)
## |   |   |   |   Systolic > 0.536364
## |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   Age <= 0.205128
## |   |   |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   |   |   Age <= 0.128205: Borderline risk (5.0)
## |   |   |   |   |   |   |   |   Age > 0.128205: Low risk (6.0/1.0)
## |   |   |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   |   |   Cholesterol <= 0.685714: Intermediate risk (5.0/1.0)
## |   |   |   |   |   |   |   |   Cholesterol > 0.685714: Borderline risk (8.0)
## |   |   |   |   |   |   Age > 0.205128
## |   |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   |   Age <= 0.25641: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   Age > 0.25641: Low risk (2.0)
## |   |   |   |   |   |   |   isSmoker > 0: Intermediate risk (4.0)
## |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   Systolic <= 0.890909
## |   |   |   |   |   |   |   Age <= 0.076923
## |   |   |   |   |   |   |   |   HDL <= 0.5625: Intermediate risk (3.0)
## |   |   |   |   |   |   |   |   HDL > 0.5625: Borderline risk (7.0)
## |   |   |   |   |   |   |   Age > 0.076923
## |   |   |   |   |   |   |   |   Age <= 0.179487: Intermediate risk (7.0)
## |   |   |   |   |   |   |   |   Age > 0.179487
## |   |   |   |   |   |   |   |   |   isDiabetic <= 0: Intermediate risk (4.0/1.0)
## |   |   |   |   |   |   |   |   |   isDiabetic > 0: High risk (2.0)
## |   |   |   |   |   |   Systolic > 0.890909: High risk (7.0)
## |   |   Age > 0.282051
## |   |   |   Systolic <= 0.7
## |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   Age <= 0.487179
## |   |   |   |   |   |   |   Systolic <= 0.381818: Low risk (19.0)
## |   |   |   |   |   |   |   Systolic > 0.381818
## |   |   |   |   |   |   |   |   HDL <= 0.55: Borderline risk (3.0)
## |   |   |   |   |   |   |   |   HDL > 0.55: Low risk (10.0/1.0)
## |   |   |   |   |   |   Age > 0.487179
## |   |   |   |   |   |   |   Cholesterol <= 0.3: Low risk (3.0)
## |   |   |   |   |   |   |   Cholesterol > 0.3
## |   |   |   |   |   |   |   |   Systolic <= 0.363636: Borderline risk (13.0)
## |   |   |   |   |   |   |   |   Systolic > 0.363636
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.414286: Borderline risk (2.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.414286: Intermediate risk (2.0)
## |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   Systolic <= 0.663636
## |   |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   |   isSmoker <= 0: Low risk (5.0)
## |   |   |   |   |   |   |   |   isSmoker > 0: Intermediate risk (2.0)
## |   |   |   |   |   |   |   isHypertensive > 0: Intermediate risk (18.0)
## |   |   |   |   |   |   Systolic > 0.663636: Borderline risk (7.0)
## |   |   |   |   isDiabetic > 0
## |   |   |   |   |   Age <= 0.461538
## |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   |   |   HDL <= 0.675: Borderline risk (8.0/1.0)
## |   |   |   |   |   |   |   |   |   HDL > 0.675: Low risk (4.0)
## |   |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   |   Systolic <= 0.290909: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   Systolic > 0.290909: Intermediate risk (2.0)
## |   |   |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   |   |   Systolic <= 0.072727: Borderline risk (12.0)
## |   |   |   |   |   |   |   |   Systolic > 0.072727
## |   |   |   |   |   |   |   |   |   isHypertensive <= 0: Intermediate risk (5.0)
## |   |   |   |   |   |   |   |   |   isHypertensive > 0: Borderline risk (4.0)
## |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   |   Systolic <= 0.6
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.628571: Borderline risk (13.0/1.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.628571: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   Systolic > 0.6: High risk (2.0)
## |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   isMale <= 0: Intermediate risk (3.0/1.0)
## |   |   |   |   |   |   |   |   isMale > 0: High risk (2.0)
## |   |   |   |   |   Age > 0.461538
## |   |   |   |   |   |   Cholesterol <= 0.328571: Borderline risk (2.0)
## |   |   |   |   |   |   Cholesterol > 0.328571: Intermediate risk (19.0/1.0)
## |   |   |   Systolic > 0.7
## |   |   |   |   Systolic <= 0.9
## |   |   |   |   |   isSmoker <= 0: Intermediate risk (12.0)
## |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   Age <= 0.384615: Intermediate risk (6.0)
## |   |   |   |   |   |   Age > 0.384615: High risk (7.0/1.0)
## |   |   |   |   Systolic > 0.9
## |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   Systolic <= 0.936364: Borderline risk (7.0)
## |   |   |   |   |   |   Systolic > 0.936364: Intermediate risk (5.0/1.0)
## |   |   |   |   |   isDiabetic > 0: High risk (4.0)
## Age > 0.564103
## |   Systolic <= 0.5
## |   |   isDiabetic <= 0
## |   |   |   HDL <= 0.15
## |   |   |   |   Systolic <= 0.190909
## |   |   |   |   |   isMale <= 0: Low risk (2.0)
## |   |   |   |   |   isMale > 0: Intermediate risk (2.0)
## |   |   |   |   Systolic > 0.190909: High risk (9.0)
## |   |   |   HDL > 0.15
## |   |   |   |   Systolic <= 0.427273
## |   |   |   |   |   Cholesterol <= 0.7
## |   |   |   |   |   |   Systolic <= 0.290909
## |   |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   |   HDL <= 0.6: Intermediate risk (7.0)
## |   |   |   |   |   |   |   |   HDL > 0.6
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.371429: Intermediate risk (3.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.371429
## |   |   |   |   |   |   |   |   |   |   Systolic <= 0.054545: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   |   |   Systolic > 0.054545: Borderline risk (18.0)
## |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   Systolic <= 0.172727
## |   |   |   |   |   |   |   |   |   Age <= 0.692308: Low risk (3.0)
## |   |   |   |   |   |   |   |   |   Age > 0.692308: Intermediate risk (3.0)
## |   |   |   |   |   |   |   |   Systolic > 0.172727: Borderline risk (7.0/1.0)
## |   |   |   |   |   |   Systolic > 0.290909: Intermediate risk (15.0/1.0)
## |   |   |   |   |   Cholesterol > 0.7
## |   |   |   |   |   |   Age <= 0.897436: Intermediate risk (12.0)
## |   |   |   |   |   |   Age > 0.897436
## |   |   |   |   |   |   |   Systolic <= 0.209091: Intermediate risk (3.0/1.0)
## |   |   |   |   |   |   |   Systolic > 0.209091: High risk (3.0)
## |   |   |   |   Systolic > 0.427273
## |   |   |   |   |   Systolic <= 0.472727: High risk (5.0)
## |   |   |   |   |   Systolic > 0.472727: Borderline risk (5.0)
## |   |   isDiabetic > 0
## |   |   |   isSmoker <= 0
## |   |   |   |   Age <= 0.923077
## |   |   |   |   |   Systolic <= 0.318182: Intermediate risk (21.0/3.0)
## |   |   |   |   |   Systolic > 0.318182: High risk (8.0/1.0)
## |   |   |   |   Age > 0.923077: High risk (5.0)
## |   |   |   isSmoker > 0
## |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   isBlack <= 0
## |   |   |   |   |   |   Age <= 0.794872: Intermediate risk (4.0)
## |   |   |   |   |   |   Age > 0.794872: High risk (2.0)
## |   |   |   |   |   isBlack > 0: High risk (3.0)
## |   |   |   |   isHypertensive > 0: High risk (22.0)
## |   Systolic > 0.5: High risk (128.0/10.0)
## 
## Number of Leaves  :  110
## 
## Size of the tree :   219
# Make predictions using the J48 model on the test data
testPred <- predict(C45Fit, newdata = testData)

# Create a confusion matrix
conf_matrix <- table(testPred, testData$Risk)

# Display the confusion matrix
print(conf_matrix)
##                    
## testPred            Low risk Borderline risk Intermediate risk High risk
##   Low risk               126               4                 5         1
##   Borderline risk         16             165                24        12
##   Intermediate risk        6               0                87        27
##   High risk                3               7                31       115
# Calculate performance metrics
accuracy_G1 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_G1 <-( 1 - accuracy_G1)
sensitivity_G1 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_G1 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_G1 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])


# Display performance metrics
cat("Accuracy: ", accuracy_G1, "\n")
## Accuracy:  0.7837838
cat("Error Rate: ", error_rate_G1, "\n")
## Error Rate:  0.2162162
cat("Sensitivity (Recall): ", sensitivity_G1, "\n")
## Sensitivity (Recall):  0.7371795
cat("Specificity: ", specificity_G1, "\n")
## Specificity:  0.7991543
cat("Precision: ", precision_G1, "\n")
## Precision:  0.7419355

Analysis:

- The C4.5 decision tree, employing the gain ratio criterion, showcases robust performance on our dataset with an accuracy of 78.38%. Its ability to effectively capture complex relationships is reflected in the tree’s structure, consisting of 219 nodes and 110 leaves. Notably, the model demonstrates a balanced trade-off between sensitivity (73.72%) and specificity (79.92%), indicating its proficiency in correctly identifying positive and negative instances. With a precision of 74.19%, the model reliably makes accurate positive predictions.

The decision tree’s 110 leaves and size of 219 represent the complexity and granularity with which the model classifies ASCVD risk

  • The root of the tree was identified by the attribute Age, with a threshold value of 0.564103, suggesting that age is a primary factor in determining ASCVD risk.

  • Individuals with Age less than or equal to the threshold were further analyzed for HDL cholesterol levels. An HDL level at or below 0.225 indicated a potential for increased risk, with further distinctions made based on Systolic blood pressure readings.

2-partition the data into ( 70% training, 30% testing):

set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.70 , 0.30))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]

  # Define the formula
myFormula <- Risk ~ .

# Build the J48 decision tree on the training data
C45Fit <- J48(myFormula, data = trainData )

# Create a table to compare predicted vs. actual values on the training data
table(predict(C45Fit), trainData$Risk)
##                    
##                     Low risk Borderline risk Intermediate risk High risk
##   Low risk               272               1                 5         1
##   Borderline risk          4             270                 6         4
##   Intermediate risk        5               0               265        12
##   High risk                0               0                14       273
# Print a summary of the J48 model
print(C45Fit)
## J48 pruned tree
## ------------------
## 
## Age <= 0.564103
## |   HDL <= 0.225
## |   |   Systolic <= 0.545455
## |   |   |   isDiabetic <= 0
## |   |   |   |   isSmoker <= 0
## |   |   |   |   |   Age <= 0.25641: Low risk (13.0)
## |   |   |   |   |   Age > 0.25641
## |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   Cholesterol <= 0.257143: Intermediate risk (2.0)
## |   |   |   |   |   |   |   Cholesterol > 0.257143: Borderline risk (12.0/1.0)
## |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   Systolic <= 0.081818: Low risk (3.0)
## |   |   |   |   |   |   |   Systolic > 0.081818
## |   |   |   |   |   |   |   |   Age <= 0.538462: Intermediate risk (5.0)
## |   |   |   |   |   |   |   |   Age > 0.538462: Borderline risk (2.0)
## |   |   |   |   isSmoker > 0
## |   |   |   |   |   isBlack <= 0
## |   |   |   |   |   |   Systolic <= 0.309091: Borderline risk (9.0/1.0)
## |   |   |   |   |   |   Systolic > 0.309091: Intermediate risk (2.0)
## |   |   |   |   |   isBlack > 0: Intermediate risk (13.0/1.0)
## |   |   |   isDiabetic > 0
## |   |   |   |   Age <= 0.410256
## |   |   |   |   |   Cholesterol <= 0.357143: Intermediate risk (8.0/1.0)
## |   |   |   |   |   Cholesterol > 0.357143
## |   |   |   |   |   |   isHypertensive <= 0: Borderline risk (17.0/1.0)
## |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   Cholesterol <= 0.685714
## |   |   |   |   |   |   |   |   isSmoker <= 0: Borderline risk (3.0)
## |   |   |   |   |   |   |   |   isSmoker > 0: High risk (2.0)
## |   |   |   |   |   |   |   Cholesterol > 0.685714: Intermediate risk (4.0)
## |   |   |   |   Age > 0.410256
## |   |   |   |   |   Cholesterol <= 0.271429: Low risk (3.0/1.0)
## |   |   |   |   |   Cholesterol > 0.271429: Intermediate risk (12.0/1.0)
## |   |   Systolic > 0.545455
## |   |   |   Cholesterol <= 0.014286: Borderline risk (5.0/1.0)
## |   |   |   Cholesterol > 0.014286
## |   |   |   |   isSmoker <= 0
## |   |   |   |   |   isDiabetic <= 0: Intermediate risk (12.0/1.0)
## |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   isMale <= 0: Intermediate risk (4.0/1.0)
## |   |   |   |   |   |   isMale > 0: High risk (5.0)
## |   |   |   |   isSmoker > 0
## |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   Cholesterol <= 0.242857: Intermediate risk (2.0)
## |   |   |   |   |   |   Cholesterol > 0.242857: High risk (13.0/2.0)
## |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   HDL <= 0.2: High risk (17.0)
## |   |   |   |   |   |   HDL > 0.2: Intermediate risk (3.0/1.0)
## |   HDL > 0.225
## |   |   Age <= 0.282051
## |   |   |   Systolic <= 0.163636
## |   |   |   |   isBlack <= 0: Low risk (44.0)
## |   |   |   |   isBlack > 0
## |   |   |   |   |   isMale <= 0: Low risk (9.0)
## |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   Systolic <= 0.090909: Low risk (6.0)
## |   |   |   |   |   |   Systolic > 0.090909: Intermediate risk (4.0)
## |   |   |   Systolic > 0.163636
## |   |   |   |   isBlack <= 0
## |   |   |   |   |   Cholesterol <= 0.242857: Low risk (38.0/1.0)
## |   |   |   |   |   Cholesterol > 0.242857
## |   |   |   |   |   |   HDL <= 0.8125
## |   |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   |   Age <= 0.230769: Low risk (31.0)
## |   |   |   |   |   |   |   |   Age > 0.230769
## |   |   |   |   |   |   |   |   |   isMale <= 0: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   isMale > 0: Borderline risk (12.0)
## |   |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   |   Systolic <= 0.309091
## |   |   |   |   |   |   |   |   |   Systolic <= 0.218182: Borderline risk (3.0)
## |   |   |   |   |   |   |   |   |   Systolic > 0.218182: Low risk (9.0)
## |   |   |   |   |   |   |   |   Systolic > 0.309091
## |   |   |   |   |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   |   |   |   |   isHypertensive <= 0: Borderline risk (17.0/1.0)
## |   |   |   |   |   |   |   |   |   |   isHypertensive > 0: Low risk (4.0)
## |   |   |   |   |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   |   |   |   |   Systolic <= 0.9
## |   |   |   |   |   |   |   |   |   |   |   HDL <= 0.4625
## |   |   |   |   |   |   |   |   |   |   |   |   isDiabetic <= 0: Borderline risk (8.0/1.0)
## |   |   |   |   |   |   |   |   |   |   |   |   isDiabetic > 0: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   |   |   |   HDL > 0.4625: Borderline risk (23.0)
## |   |   |   |   |   |   |   |   |   |   Systolic > 0.9: Intermediate risk (2.0)
## |   |   |   |   |   |   HDL > 0.8125
## |   |   |   |   |   |   |   isMale <= 0: Low risk (17.0)
## |   |   |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   |   |   Age <= 0.076923: Low risk (3.0)
## |   |   |   |   |   |   |   |   Age > 0.076923: Intermediate risk (2.0)
## |   |   |   |   isBlack > 0
## |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   Systolic <= 0.554545
## |   |   |   |   |   |   |   Systolic <= 0.245455
## |   |   |   |   |   |   |   |   isMale <= 0: Low risk (2.0)
## |   |   |   |   |   |   |   |   isMale > 0: Borderline risk (15.0/1.0)
## |   |   |   |   |   |   |   Systolic > 0.245455
## |   |   |   |   |   |   |   |   isHypertensive <= 0: Low risk (20.0/2.0)
## |   |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   |   HDL <= 0.4625: Borderline risk (5.0/1.0)
## |   |   |   |   |   |   |   |   |   HDL > 0.4625: Low risk (5.0)
## |   |   |   |   |   |   Systolic > 0.554545
## |   |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   |   |   |   Age <= 0.153846: Borderline risk (5.0/1.0)
## |   |   |   |   |   |   |   |   |   Age > 0.153846: Low risk (6.0)
## |   |   |   |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.7
## |   |   |   |   |   |   |   |   |   |   Systolic <= 0.718182: Borderline risk (3.0)
## |   |   |   |   |   |   |   |   |   |   Systolic > 0.718182: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.7: Borderline risk (10.0)
## |   |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   |   Age <= 0: Borderline risk (5.0/1.0)
## |   |   |   |   |   |   |   |   Age > 0
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.071429: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.071429
## |   |   |   |   |   |   |   |   |   |   Cholesterol <= 0.871429: Intermediate risk (12.0)
## |   |   |   |   |   |   |   |   |   |   Cholesterol > 0.871429: High risk (3.0/1.0)
## |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   Systolic <= 0.309091
## |   |   |   |   |   |   |   HDL <= 0.4375
## |   |   |   |   |   |   |   |   Age <= 0.205128: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   Age > 0.205128: Borderline risk (3.0)
## |   |   |   |   |   |   |   HDL > 0.4375: Low risk (6.0)
## |   |   |   |   |   |   Systolic > 0.309091
## |   |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   |   Cholesterol <= 0.314286: Borderline risk (5.0/1.0)
## |   |   |   |   |   |   |   |   Cholesterol > 0.314286: Intermediate risk (10.0/2.0)
## |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   Age <= 0.153846
## |   |   |   |   |   |   |   |   |   Systolic <= 0.881818: Intermediate risk (9.0/1.0)
## |   |   |   |   |   |   |   |   |   Systolic > 0.881818: High risk (2.0)
## |   |   |   |   |   |   |   |   Age > 0.153846: High risk (6.0)
## |   |   Age > 0.282051
## |   |   |   Systolic <= 0.254545
## |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   isHypertensive <= 0: Low risk (20.0)
## |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   |   Cholesterol <= 0.385714: Low risk (5.0)
## |   |   |   |   |   |   |   Cholesterol > 0.385714: Borderline risk (15.0/1.0)
## |   |   |   |   |   |   isMale > 0: Intermediate risk (6.0/1.0)
## |   |   |   |   isDiabetic > 0
## |   |   |   |   |   Age <= 0.435897
## |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   Systolic <= 0.2: Borderline risk (21.0/1.0)
## |   |   |   |   |   |   |   Systolic > 0.2: Low risk (3.0/1.0)
## |   |   |   |   |   |   isHypertensive > 0: Low risk (3.0)
## |   |   |   |   |   Age > 0.435897: Intermediate risk (10.0)
## |   |   |   Systolic > 0.254545
## |   |   |   |   isMale <= 0
## |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   Age <= 0.384615
## |   |   |   |   |   |   |   HDL <= 0.5125: Intermediate risk (2.0)
## |   |   |   |   |   |   |   HDL > 0.5125: Low risk (12.0)
## |   |   |   |   |   |   Age > 0.384615
## |   |   |   |   |   |   |   Cholesterol <= 0.814286
## |   |   |   |   |   |   |   |   Systolic <= 0.936364
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.414286
## |   |   |   |   |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   |   |   |   |   Age <= 0.512821: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   |   |   Age > 0.512821: Borderline risk (8.0)
## |   |   |   |   |   |   |   |   |   |   isSmoker > 0: Borderline risk (7.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.414286
## |   |   |   |   |   |   |   |   |   |   Age <= 0.512821: Borderline risk (5.0)
## |   |   |   |   |   |   |   |   |   |   Age > 0.512821: Intermediate risk (3.0)
## |   |   |   |   |   |   |   |   Systolic > 0.936364: Intermediate risk (2.0)
## |   |   |   |   |   |   |   Cholesterol > 0.814286: Low risk (3.0/1.0)
## |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   Systolic <= 0.609091
## |   |   |   |   |   |   |   |   isBlack <= 0
## |   |   |   |   |   |   |   |   |   Age <= 0.333333: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   Age > 0.333333: Borderline risk (13.0)
## |   |   |   |   |   |   |   |   isBlack > 0
## |   |   |   |   |   |   |   |   |   isSmoker <= 0: Borderline risk (4.0)
## |   |   |   |   |   |   |   |   |   isSmoker > 0: Intermediate risk (3.0)
## |   |   |   |   |   |   |   Systolic > 0.609091
## |   |   |   |   |   |   |   |   isBlack <= 0: Intermediate risk (6.0/1.0)
## |   |   |   |   |   |   |   |   isBlack > 0: High risk (2.0)
## |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   Systolic <= 0.827273
## |   |   |   |   |   |   |   |   isBlack <= 0: Intermediate risk (9.0)
## |   |   |   |   |   |   |   |   isBlack > 0
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.814286: Intermediate risk (7.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.814286: High risk (2.0)
## |   |   |   |   |   |   |   Systolic > 0.827273: High risk (3.0)
## |   |   |   |   isMale > 0
## |   |   |   |   |   Cholesterol <= 0.914286
## |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   isDiabetic <= 0: Intermediate risk (18.0)
## |   |   |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   |   |   isHypertensive <= 0: Intermediate risk (6.0/1.0)
## |   |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   |   Age <= 0.435897: Borderline risk (4.0)
## |   |   |   |   |   |   |   |   |   Age > 0.435897: Intermediate risk (2.0)
## |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   |   |   isHypertensive <= 0: Intermediate risk (7.0)
## |   |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   |   Systolic <= 0.690909: Intermediate risk (7.0)
## |   |   |   |   |   |   |   |   |   Systolic > 0.690909: High risk (4.0)
## |   |   |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   |   |   Cholesterol <= 0.128571: Intermediate risk (4.0)
## |   |   |   |   |   |   |   |   Cholesterol > 0.128571: High risk (11.0/1.0)
## |   |   |   |   |   Cholesterol > 0.914286
## |   |   |   |   |   |   isHypertensive <= 0: Intermediate risk (2.0)
## |   |   |   |   |   |   isHypertensive > 0: Borderline risk (7.0)
## Age > 0.564103
## |   Systolic <= 0.5
## |   |   isDiabetic <= 0
## |   |   |   HDL <= 0.15
## |   |   |   |   Systolic <= 0.190909
## |   |   |   |   |   isMale <= 0: Low risk (2.0)
## |   |   |   |   |   isMale > 0: Intermediate risk (2.0)
## |   |   |   |   Systolic > 0.190909: High risk (9.0)
## |   |   |   HDL > 0.15
## |   |   |   |   Age <= 0.692308
## |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   Systolic <= 0.172727: Low risk (4.0/1.0)
## |   |   |   |   |   |   Systolic > 0.172727
## |   |   |   |   |   |   |   Age <= 0.589744: Intermediate risk (3.0/1.0)
## |   |   |   |   |   |   |   Age > 0.589744: Borderline risk (22.0)
## |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   Age <= 0.589744: Borderline risk (3.0)
## |   |   |   |   |   |   Age > 0.589744: Intermediate risk (9.0/1.0)
## |   |   |   |   Age > 0.692308
## |   |   |   |   |   HDL <= 0.975
## |   |   |   |   |   |   Systolic <= 0.427273
## |   |   |   |   |   |   |   Cholesterol <= 0.057143
## |   |   |   |   |   |   |   |   Cholesterol <= 0.028571: Intermediate risk (5.0)
## |   |   |   |   |   |   |   |   Cholesterol > 0.028571: Borderline risk (4.0)
## |   |   |   |   |   |   |   Cholesterol > 0.057143
## |   |   |   |   |   |   |   |   isSmoker <= 0: Intermediate risk (25.0/2.0)
## |   |   |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   |   |   Age <= 0.769231: Intermediate risk (4.0)
## |   |   |   |   |   |   |   |   |   Age > 0.769231
## |   |   |   |   |   |   |   |   |   |   Systolic <= 0.072727: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   |   |   Systolic > 0.072727: High risk (4.0)
## |   |   |   |   |   |   Systolic > 0.427273: High risk (5.0)
## |   |   |   |   |   HDL > 0.975: Borderline risk (5.0)
## |   |   isDiabetic > 0
## |   |   |   isSmoker <= 0
## |   |   |   |   Systolic <= 0.318182
## |   |   |   |   |   Age <= 0.820513: Intermediate risk (18.0/1.0)
## |   |   |   |   |   Age > 0.820513
## |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   Age <= 0.948718: Intermediate risk (4.0)
## |   |   |   |   |   |   |   Age > 0.948718: High risk (4.0/1.0)
## |   |   |   |   |   |   isHypertensive > 0: High risk (3.0)
## |   |   |   |   Systolic > 0.318182: High risk (10.0/1.0)
## |   |   |   isSmoker > 0
## |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   isBlack <= 0
## |   |   |   |   |   |   Age <= 0.794872: Intermediate risk (4.0)
## |   |   |   |   |   |   Age > 0.794872: High risk (2.0)
## |   |   |   |   |   isBlack > 0: High risk (4.0)
## |   |   |   |   isHypertensive > 0: High risk (28.0)
## |   Systolic > 0.5
## |   |   Age <= 0.589744
## |   |   |   isDiabetic <= 0: Borderline risk (4.0/1.0)
## |   |   |   isDiabetic > 0: High risk (7.0/1.0)
## |   |   Age > 0.589744: High risk (141.0/7.0)
## 
## Number of Leaves  :  131
## 
## Size of the tree :   261
# Make predictions using the J48 model on the test data
testPred <- predict(C45Fit, newdata = testData)


# Display the confusion matrix
print(conf_matrix)
##                    
## testPred            Low risk Borderline risk Intermediate risk High risk
##   Low risk               126               4                 5         1
##   Borderline risk         16             165                24        12
##   Intermediate risk        6               0                87        27
##   High risk                3               7                31       115
# Calculate performance metrics
accuracy_G2 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_G2 <-( 1 - accuracy_G2)
sensitivity_G2 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_G2 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_G2 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])


# Display performance metrics
cat("Accuracy: ", accuracy_G2, "\n")
## Accuracy:  0.7837838
cat("Error Rate: ", error_rate_G2, "\n")
## Error Rate:  0.2162162
cat("Sensitivity (Recall): ", sensitivity_G2, "\n")
## Sensitivity (Recall):  0.7371795
cat("Specificity: ", specificity_G2, "\n")
## Specificity:  0.7991543
cat("Precision: ", precision_G2, "\n")
## Precision:  0.7419355

Analysis:

The C4.5 decision tree, employing the gain ratio criterion, exhibits strong predictive accuracy with an impressive 78.39%. Characterized by 261 nodes and 131 leaves, the tree’s depth allows it to capture intricate patterns within the data. Notably, the model strikes a balance between sensitivity (73%) and specificity (79.78%), showcasing its ability to effectively identify positive and negative instances. With a precision of 74.1%, the model demonstrates accuracy in positive predictions.

  • Age is a critical initial factor. Individuals younger than the threshold value are further assessed for HDL and systolic blood pressure levels.with a threshold of 0.564103.

  • DL cholesterol and Systolic blood pressure are critical secondary predictors, stratifying patients into risk categories from low to high.

  • The highest risk category is assigned to older individuals (Age > 0.589744) with high Systolic blood pressure, indicating that age and blood pressure are critical factors in predicting high ASCVD risk.

3-partition the data into ( 80% training, 20% testing):

set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.80 , 0.20))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]

 # Define the formula
myFormula <- Risk ~ .

# Print a summary of the J48 model
print(C45Fit)
## J48 pruned tree
## ------------------
## 
## Age <= 0.564103
## |   HDL <= 0.225
## |   |   Systolic <= 0.545455
## |   |   |   isDiabetic <= 0
## |   |   |   |   isSmoker <= 0
## |   |   |   |   |   Age <= 0.25641: Low risk (13.0)
## |   |   |   |   |   Age > 0.25641
## |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   Cholesterol <= 0.257143: Intermediate risk (2.0)
## |   |   |   |   |   |   |   Cholesterol > 0.257143: Borderline risk (12.0/1.0)
## |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   Systolic <= 0.081818: Low risk (3.0)
## |   |   |   |   |   |   |   Systolic > 0.081818
## |   |   |   |   |   |   |   |   Age <= 0.538462: Intermediate risk (5.0)
## |   |   |   |   |   |   |   |   Age > 0.538462: Borderline risk (2.0)
## |   |   |   |   isSmoker > 0
## |   |   |   |   |   isBlack <= 0
## |   |   |   |   |   |   Systolic <= 0.309091: Borderline risk (9.0/1.0)
## |   |   |   |   |   |   Systolic > 0.309091: Intermediate risk (2.0)
## |   |   |   |   |   isBlack > 0: Intermediate risk (13.0/1.0)
## |   |   |   isDiabetic > 0
## |   |   |   |   Age <= 0.410256
## |   |   |   |   |   Cholesterol <= 0.357143: Intermediate risk (8.0/1.0)
## |   |   |   |   |   Cholesterol > 0.357143
## |   |   |   |   |   |   isHypertensive <= 0: Borderline risk (17.0/1.0)
## |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   Cholesterol <= 0.685714
## |   |   |   |   |   |   |   |   isSmoker <= 0: Borderline risk (3.0)
## |   |   |   |   |   |   |   |   isSmoker > 0: High risk (2.0)
## |   |   |   |   |   |   |   Cholesterol > 0.685714: Intermediate risk (4.0)
## |   |   |   |   Age > 0.410256
## |   |   |   |   |   Cholesterol <= 0.271429: Low risk (3.0/1.0)
## |   |   |   |   |   Cholesterol > 0.271429: Intermediate risk (12.0/1.0)
## |   |   Systolic > 0.545455
## |   |   |   Cholesterol <= 0.014286: Borderline risk (5.0/1.0)
## |   |   |   Cholesterol > 0.014286
## |   |   |   |   isSmoker <= 0
## |   |   |   |   |   isDiabetic <= 0: Intermediate risk (12.0/1.0)
## |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   isMale <= 0: Intermediate risk (4.0/1.0)
## |   |   |   |   |   |   isMale > 0: High risk (5.0)
## |   |   |   |   isSmoker > 0
## |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   Cholesterol <= 0.242857: Intermediate risk (2.0)
## |   |   |   |   |   |   Cholesterol > 0.242857: High risk (13.0/2.0)
## |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   HDL <= 0.2: High risk (17.0)
## |   |   |   |   |   |   HDL > 0.2: Intermediate risk (3.0/1.0)
## |   HDL > 0.225
## |   |   Age <= 0.282051
## |   |   |   Systolic <= 0.163636
## |   |   |   |   isBlack <= 0: Low risk (44.0)
## |   |   |   |   isBlack > 0
## |   |   |   |   |   isMale <= 0: Low risk (9.0)
## |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   Systolic <= 0.090909: Low risk (6.0)
## |   |   |   |   |   |   Systolic > 0.090909: Intermediate risk (4.0)
## |   |   |   Systolic > 0.163636
## |   |   |   |   isBlack <= 0
## |   |   |   |   |   Cholesterol <= 0.242857: Low risk (38.0/1.0)
## |   |   |   |   |   Cholesterol > 0.242857
## |   |   |   |   |   |   HDL <= 0.8125
## |   |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   |   Age <= 0.230769: Low risk (31.0)
## |   |   |   |   |   |   |   |   Age > 0.230769
## |   |   |   |   |   |   |   |   |   isMale <= 0: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   isMale > 0: Borderline risk (12.0)
## |   |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   |   Systolic <= 0.309091
## |   |   |   |   |   |   |   |   |   Systolic <= 0.218182: Borderline risk (3.0)
## |   |   |   |   |   |   |   |   |   Systolic > 0.218182: Low risk (9.0)
## |   |   |   |   |   |   |   |   Systolic > 0.309091
## |   |   |   |   |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   |   |   |   |   isHypertensive <= 0: Borderline risk (17.0/1.0)
## |   |   |   |   |   |   |   |   |   |   isHypertensive > 0: Low risk (4.0)
## |   |   |   |   |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   |   |   |   |   Systolic <= 0.9
## |   |   |   |   |   |   |   |   |   |   |   HDL <= 0.4625
## |   |   |   |   |   |   |   |   |   |   |   |   isDiabetic <= 0: Borderline risk (8.0/1.0)
## |   |   |   |   |   |   |   |   |   |   |   |   isDiabetic > 0: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   |   |   |   HDL > 0.4625: Borderline risk (23.0)
## |   |   |   |   |   |   |   |   |   |   Systolic > 0.9: Intermediate risk (2.0)
## |   |   |   |   |   |   HDL > 0.8125
## |   |   |   |   |   |   |   isMale <= 0: Low risk (17.0)
## |   |   |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   |   |   Age <= 0.076923: Low risk (3.0)
## |   |   |   |   |   |   |   |   Age > 0.076923: Intermediate risk (2.0)
## |   |   |   |   isBlack > 0
## |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   Systolic <= 0.554545
## |   |   |   |   |   |   |   Systolic <= 0.245455
## |   |   |   |   |   |   |   |   isMale <= 0: Low risk (2.0)
## |   |   |   |   |   |   |   |   isMale > 0: Borderline risk (15.0/1.0)
## |   |   |   |   |   |   |   Systolic > 0.245455
## |   |   |   |   |   |   |   |   isHypertensive <= 0: Low risk (20.0/2.0)
## |   |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   |   HDL <= 0.4625: Borderline risk (5.0/1.0)
## |   |   |   |   |   |   |   |   |   HDL > 0.4625: Low risk (5.0)
## |   |   |   |   |   |   Systolic > 0.554545
## |   |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   |   |   |   Age <= 0.153846: Borderline risk (5.0/1.0)
## |   |   |   |   |   |   |   |   |   Age > 0.153846: Low risk (6.0)
## |   |   |   |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.7
## |   |   |   |   |   |   |   |   |   |   Systolic <= 0.718182: Borderline risk (3.0)
## |   |   |   |   |   |   |   |   |   |   Systolic > 0.718182: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.7: Borderline risk (10.0)
## |   |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   |   Age <= 0: Borderline risk (5.0/1.0)
## |   |   |   |   |   |   |   |   Age > 0
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.071429: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.071429
## |   |   |   |   |   |   |   |   |   |   Cholesterol <= 0.871429: Intermediate risk (12.0)
## |   |   |   |   |   |   |   |   |   |   Cholesterol > 0.871429: High risk (3.0/1.0)
## |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   Systolic <= 0.309091
## |   |   |   |   |   |   |   HDL <= 0.4375
## |   |   |   |   |   |   |   |   Age <= 0.205128: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   Age > 0.205128: Borderline risk (3.0)
## |   |   |   |   |   |   |   HDL > 0.4375: Low risk (6.0)
## |   |   |   |   |   |   Systolic > 0.309091
## |   |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   |   Cholesterol <= 0.314286: Borderline risk (5.0/1.0)
## |   |   |   |   |   |   |   |   Cholesterol > 0.314286: Intermediate risk (10.0/2.0)
## |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   Age <= 0.153846
## |   |   |   |   |   |   |   |   |   Systolic <= 0.881818: Intermediate risk (9.0/1.0)
## |   |   |   |   |   |   |   |   |   Systolic > 0.881818: High risk (2.0)
## |   |   |   |   |   |   |   |   Age > 0.153846: High risk (6.0)
## |   |   Age > 0.282051
## |   |   |   Systolic <= 0.254545
## |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   isHypertensive <= 0: Low risk (20.0)
## |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   |   Cholesterol <= 0.385714: Low risk (5.0)
## |   |   |   |   |   |   |   Cholesterol > 0.385714: Borderline risk (15.0/1.0)
## |   |   |   |   |   |   isMale > 0: Intermediate risk (6.0/1.0)
## |   |   |   |   isDiabetic > 0
## |   |   |   |   |   Age <= 0.435897
## |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   Systolic <= 0.2: Borderline risk (21.0/1.0)
## |   |   |   |   |   |   |   Systolic > 0.2: Low risk (3.0/1.0)
## |   |   |   |   |   |   isHypertensive > 0: Low risk (3.0)
## |   |   |   |   |   Age > 0.435897: Intermediate risk (10.0)
## |   |   |   Systolic > 0.254545
## |   |   |   |   isMale <= 0
## |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   Age <= 0.384615
## |   |   |   |   |   |   |   HDL <= 0.5125: Intermediate risk (2.0)
## |   |   |   |   |   |   |   HDL > 0.5125: Low risk (12.0)
## |   |   |   |   |   |   Age > 0.384615
## |   |   |   |   |   |   |   Cholesterol <= 0.814286
## |   |   |   |   |   |   |   |   Systolic <= 0.936364
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.414286
## |   |   |   |   |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   |   |   |   |   Age <= 0.512821: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   |   |   Age > 0.512821: Borderline risk (8.0)
## |   |   |   |   |   |   |   |   |   |   isSmoker > 0: Borderline risk (7.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.414286
## |   |   |   |   |   |   |   |   |   |   Age <= 0.512821: Borderline risk (5.0)
## |   |   |   |   |   |   |   |   |   |   Age > 0.512821: Intermediate risk (3.0)
## |   |   |   |   |   |   |   |   Systolic > 0.936364: Intermediate risk (2.0)
## |   |   |   |   |   |   |   Cholesterol > 0.814286: Low risk (3.0/1.0)
## |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   Systolic <= 0.609091
## |   |   |   |   |   |   |   |   isBlack <= 0
## |   |   |   |   |   |   |   |   |   Age <= 0.333333: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   Age > 0.333333: Borderline risk (13.0)
## |   |   |   |   |   |   |   |   isBlack > 0
## |   |   |   |   |   |   |   |   |   isSmoker <= 0: Borderline risk (4.0)
## |   |   |   |   |   |   |   |   |   isSmoker > 0: Intermediate risk (3.0)
## |   |   |   |   |   |   |   Systolic > 0.609091
## |   |   |   |   |   |   |   |   isBlack <= 0: Intermediate risk (6.0/1.0)
## |   |   |   |   |   |   |   |   isBlack > 0: High risk (2.0)
## |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   Systolic <= 0.827273
## |   |   |   |   |   |   |   |   isBlack <= 0: Intermediate risk (9.0)
## |   |   |   |   |   |   |   |   isBlack > 0
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.814286: Intermediate risk (7.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.814286: High risk (2.0)
## |   |   |   |   |   |   |   Systolic > 0.827273: High risk (3.0)
## |   |   |   |   isMale > 0
## |   |   |   |   |   Cholesterol <= 0.914286
## |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   isDiabetic <= 0: Intermediate risk (18.0)
## |   |   |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   |   |   isHypertensive <= 0: Intermediate risk (6.0/1.0)
## |   |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   |   Age <= 0.435897: Borderline risk (4.0)
## |   |   |   |   |   |   |   |   |   Age > 0.435897: Intermediate risk (2.0)
## |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   |   |   isHypertensive <= 0: Intermediate risk (7.0)
## |   |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   |   Systolic <= 0.690909: Intermediate risk (7.0)
## |   |   |   |   |   |   |   |   |   Systolic > 0.690909: High risk (4.0)
## |   |   |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   |   |   Cholesterol <= 0.128571: Intermediate risk (4.0)
## |   |   |   |   |   |   |   |   Cholesterol > 0.128571: High risk (11.0/1.0)
## |   |   |   |   |   Cholesterol > 0.914286
## |   |   |   |   |   |   isHypertensive <= 0: Intermediate risk (2.0)
## |   |   |   |   |   |   isHypertensive > 0: Borderline risk (7.0)
## Age > 0.564103
## |   Systolic <= 0.5
## |   |   isDiabetic <= 0
## |   |   |   HDL <= 0.15
## |   |   |   |   Systolic <= 0.190909
## |   |   |   |   |   isMale <= 0: Low risk (2.0)
## |   |   |   |   |   isMale > 0: Intermediate risk (2.0)
## |   |   |   |   Systolic > 0.190909: High risk (9.0)
## |   |   |   HDL > 0.15
## |   |   |   |   Age <= 0.692308
## |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   Systolic <= 0.172727: Low risk (4.0/1.0)
## |   |   |   |   |   |   Systolic > 0.172727
## |   |   |   |   |   |   |   Age <= 0.589744: Intermediate risk (3.0/1.0)
## |   |   |   |   |   |   |   Age > 0.589744: Borderline risk (22.0)
## |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   Age <= 0.589744: Borderline risk (3.0)
## |   |   |   |   |   |   Age > 0.589744: Intermediate risk (9.0/1.0)
## |   |   |   |   Age > 0.692308
## |   |   |   |   |   HDL <= 0.975
## |   |   |   |   |   |   Systolic <= 0.427273
## |   |   |   |   |   |   |   Cholesterol <= 0.057143
## |   |   |   |   |   |   |   |   Cholesterol <= 0.028571: Intermediate risk (5.0)
## |   |   |   |   |   |   |   |   Cholesterol > 0.028571: Borderline risk (4.0)
## |   |   |   |   |   |   |   Cholesterol > 0.057143
## |   |   |   |   |   |   |   |   isSmoker <= 0: Intermediate risk (25.0/2.0)
## |   |   |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   |   |   Age <= 0.769231: Intermediate risk (4.0)
## |   |   |   |   |   |   |   |   |   Age > 0.769231
## |   |   |   |   |   |   |   |   |   |   Systolic <= 0.072727: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   |   |   Systolic > 0.072727: High risk (4.0)
## |   |   |   |   |   |   Systolic > 0.427273: High risk (5.0)
## |   |   |   |   |   HDL > 0.975: Borderline risk (5.0)
## |   |   isDiabetic > 0
## |   |   |   isSmoker <= 0
## |   |   |   |   Systolic <= 0.318182
## |   |   |   |   |   Age <= 0.820513: Intermediate risk (18.0/1.0)
## |   |   |   |   |   Age > 0.820513
## |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   Age <= 0.948718: Intermediate risk (4.0)
## |   |   |   |   |   |   |   Age > 0.948718: High risk (4.0/1.0)
## |   |   |   |   |   |   isHypertensive > 0: High risk (3.0)
## |   |   |   |   Systolic > 0.318182: High risk (10.0/1.0)
## |   |   |   isSmoker > 0
## |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   isBlack <= 0
## |   |   |   |   |   |   Age <= 0.794872: Intermediate risk (4.0)
## |   |   |   |   |   |   Age > 0.794872: High risk (2.0)
## |   |   |   |   |   isBlack > 0: High risk (4.0)
## |   |   |   |   isHypertensive > 0: High risk (28.0)
## |   Systolic > 0.5
## |   |   Age <= 0.589744
## |   |   |   isDiabetic <= 0: Borderline risk (4.0/1.0)
## |   |   |   isDiabetic > 0: High risk (7.0/1.0)
## |   |   Age > 0.589744: High risk (141.0/7.0)
## 
## Number of Leaves  :  131
## 
## Size of the tree :   261
# Make predictions using the J48 model on the test data
testPred <- predict(C45Fit, newdata = testData)

# Create a confusion matrix
conf_matrix <- table(testPred, testData$Risk)

# Display the confusion matrix
print(conf_matrix)
##                    
## testPred            Low risk Borderline risk Intermediate risk High risk
##   Low risk                58               2                 8         2
##   Borderline risk          8              90                 6         2
##   Intermediate risk        9               0                42        17
##   High risk                0               0                17        55
# Calculate performance metrics
accuracy_G3 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_G3 <-( 1 - accuracy_G3)
sensitivity_G3 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_G3 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_G3 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])

accuracy <- sum(testPred == testData$Risk) / length(testPred)
# Display performance metrics
cat("Accuracy: ", accuracy_G3, "\n")
## Accuracy:  0.7753165
cat("Error Rate: ", error_rate_G3, "\n")
## Error Rate:  0.2246835
cat("Sensitivity (Recall): ", sensitivity_G3, "\n")
## Sensitivity (Recall):  0.7638889
cat("Specificity: ", specificity_G3, "\n")
## Specificity:  0.7786885
cat("Precision: ", precision_G3, "\n")
## Precision:  0.7236842

Analysis:

The C4.5 decision tree, employing the gain ratio criterion, demonstrates a commendable accuracy of 81.%. With a substantial tree size of 305 and 153 leaves, the model captures nuanced relationships within the dataset. Its predictive prowess is evident in the balanced sensitivity (78.26%) and specificity (81.78%), highlighting its ability to correctly identify both positive and negative instances. The precision of 71.05% emphasizes the accuracy of positive predictions. This collectively positions the C4.5 decision tree as a robust and effective choice for classification on our dataset, showcasing its capability to achieve high accuracy and reliable predictions.

  • Age is a critical initial factor. Individuals younger than the threshold value are further assessed for HDL and systolic blood pressure levels.
  • Younger individuals (Age ≤ 56.41%) with lower HDL (≤ 22.5%) and systolic blood pressure (≤ 54.5455%) who are not diabetic tend to be at low cardiovascular risk.
  • For those above the age threshold of 56.41%, higher systolic blood pressure significantly increases risk, with other factors like smoking and diabetes further escalating the risk to high.

After we have created a decision tree using the Gain ratio of three different sizes, we will now calculate the comparison between the three models

# Create data frames for each model's summary
summary_c4.5_1 <- data.frame(
  Model = "60% training, 40% testing",
  Accuracy = 78.38,
  Sensitivity = 73.72,
  Specificity = 79.92,
  Precision = 74.19
)

summary_c4.5_2 <- data.frame(
  Model = "70% training, 30% testing",
  Accuracy = 79.39,
  Sensitivity = 78.0,
  Specificity = 79.78,
  Precision = 72.90
)

summary_c4.5_3 <- data.frame(
  Model = "80% training, 20% testing",
  Accuracy = 81.01,
  Sensitivity = 78.26,
  Specificity = 81.78,
  Precision = 71.05
)

# Combine the summaries into a single data frame
comparison_table <- rbind(summary_c4.5_1, summary_c4.5_2, summary_c4.5_3)

# Print the comparison table
print(comparison_table)
##                       Model Accuracy Sensitivity Specificity Precision
## 1 60% training, 40% testing    78.38       73.72       79.92     74.19
## 2 70% training, 30% testing    79.39       78.00       79.78     72.90
## 3 80% training, 20% testing    81.01       78.26       81.78     71.05

In our exploration of decision tree models—specifically, C4.5 with varying numbers of training-testing —we aimed to identify the optimal configuration for accurate and reliable predictions. The results indicate that the model with (80% training, 20% testing) stands out, achieving the highest accuracy at 81.01%. This particular configuration strikes a balance between sensitivity (78.26%), specificity (81.78%), and precision (71.05%), making it a robust choice for the classification task at hand.

It’s noteworthy that the model with (70% training, 30% testing) also performs well, showcasing competitive accuracy (79.39%) and a balanced trade-off between sensitivity and specificity. However, the model with (60% training, 40% testing) surpasses it, demonstrating superior sensitivity and precision.

In contrast, the model with (60% training, 40% testing), while achieving a respectable accuracy of 78.38%, exhibits slightly lower sensitivity and precision. This suggests that a more complex tree structure, as seen in the model with (80% training, 20% testing), contributes to better capturing the underlying patterns in the data.

In conclusion, the C4.5 decision tree with (80% training, 20% testing) emerges as the preferred model for this specific dataset and classification task. Its superior performance in terms of accuracy, sensitivity, specificity, and precision underscores its suitability for making reliable predictions.

Decision tree using Information gain

-For the construction of our decision tree model, we have opted for the C5.0 algorithm, a sophisticated and versatile tool known for its proficiency in handling classification tasks. Specifically, we harness the power of information gain as the guiding criterion within C5.0. This choice is deliberate, as information gain allows the algorithm to discern the most relevant and discriminative features in our dataset, facilitating the creation of a decision tree that excels at capturing intricate patterns and relationships.

1-partition the data into ( 60% training, 40% testing):

set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.60 , 0.40))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# install.packages("C50")
library(C50)

# Define the formula
myFormula <- Risk ~ .

# Build the C5.0 decision tree on the training data with information gain
c50_model <- C5.0(myFormula, data = trainData)

# Display a summary of the decision tree
print(c50_model)
## 
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
## 
## Classification Tree
## Number of samples: 959 
## Number of predictors: 9 
## 
## Tree size: 105 
## 
## Non-standard options: attempt to group attributes
# Make predictions using the C5.0 model on the test data
testPred <- predict(c50_model, newdata = testData)

print(C45Fit)
## J48 pruned tree
## ------------------
## 
## Age <= 0.564103
## |   HDL <= 0.225
## |   |   Systolic <= 0.545455
## |   |   |   isDiabetic <= 0
## |   |   |   |   isSmoker <= 0
## |   |   |   |   |   Age <= 0.25641: Low risk (13.0)
## |   |   |   |   |   Age > 0.25641
## |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   Cholesterol <= 0.257143: Intermediate risk (2.0)
## |   |   |   |   |   |   |   Cholesterol > 0.257143: Borderline risk (12.0/1.0)
## |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   Systolic <= 0.081818: Low risk (3.0)
## |   |   |   |   |   |   |   Systolic > 0.081818
## |   |   |   |   |   |   |   |   Age <= 0.538462: Intermediate risk (5.0)
## |   |   |   |   |   |   |   |   Age > 0.538462: Borderline risk (2.0)
## |   |   |   |   isSmoker > 0
## |   |   |   |   |   isBlack <= 0
## |   |   |   |   |   |   Systolic <= 0.309091: Borderline risk (9.0/1.0)
## |   |   |   |   |   |   Systolic > 0.309091: Intermediate risk (2.0)
## |   |   |   |   |   isBlack > 0: Intermediate risk (13.0/1.0)
## |   |   |   isDiabetic > 0
## |   |   |   |   Age <= 0.410256
## |   |   |   |   |   Cholesterol <= 0.357143: Intermediate risk (8.0/1.0)
## |   |   |   |   |   Cholesterol > 0.357143
## |   |   |   |   |   |   isHypertensive <= 0: Borderline risk (17.0/1.0)
## |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   Cholesterol <= 0.685714
## |   |   |   |   |   |   |   |   isSmoker <= 0: Borderline risk (3.0)
## |   |   |   |   |   |   |   |   isSmoker > 0: High risk (2.0)
## |   |   |   |   |   |   |   Cholesterol > 0.685714: Intermediate risk (4.0)
## |   |   |   |   Age > 0.410256
## |   |   |   |   |   Cholesterol <= 0.271429: Low risk (3.0/1.0)
## |   |   |   |   |   Cholesterol > 0.271429: Intermediate risk (12.0/1.0)
## |   |   Systolic > 0.545455
## |   |   |   Cholesterol <= 0.014286: Borderline risk (5.0/1.0)
## |   |   |   Cholesterol > 0.014286
## |   |   |   |   isSmoker <= 0
## |   |   |   |   |   isDiabetic <= 0: Intermediate risk (12.0/1.0)
## |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   isMale <= 0: Intermediate risk (4.0/1.0)
## |   |   |   |   |   |   isMale > 0: High risk (5.0)
## |   |   |   |   isSmoker > 0
## |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   Cholesterol <= 0.242857: Intermediate risk (2.0)
## |   |   |   |   |   |   Cholesterol > 0.242857: High risk (13.0/2.0)
## |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   HDL <= 0.2: High risk (17.0)
## |   |   |   |   |   |   HDL > 0.2: Intermediate risk (3.0/1.0)
## |   HDL > 0.225
## |   |   Age <= 0.282051
## |   |   |   Systolic <= 0.163636
## |   |   |   |   isBlack <= 0: Low risk (44.0)
## |   |   |   |   isBlack > 0
## |   |   |   |   |   isMale <= 0: Low risk (9.0)
## |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   Systolic <= 0.090909: Low risk (6.0)
## |   |   |   |   |   |   Systolic > 0.090909: Intermediate risk (4.0)
## |   |   |   Systolic > 0.163636
## |   |   |   |   isBlack <= 0
## |   |   |   |   |   Cholesterol <= 0.242857: Low risk (38.0/1.0)
## |   |   |   |   |   Cholesterol > 0.242857
## |   |   |   |   |   |   HDL <= 0.8125
## |   |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   |   Age <= 0.230769: Low risk (31.0)
## |   |   |   |   |   |   |   |   Age > 0.230769
## |   |   |   |   |   |   |   |   |   isMale <= 0: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   isMale > 0: Borderline risk (12.0)
## |   |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   |   Systolic <= 0.309091
## |   |   |   |   |   |   |   |   |   Systolic <= 0.218182: Borderline risk (3.0)
## |   |   |   |   |   |   |   |   |   Systolic > 0.218182: Low risk (9.0)
## |   |   |   |   |   |   |   |   Systolic > 0.309091
## |   |   |   |   |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   |   |   |   |   isHypertensive <= 0: Borderline risk (17.0/1.0)
## |   |   |   |   |   |   |   |   |   |   isHypertensive > 0: Low risk (4.0)
## |   |   |   |   |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   |   |   |   |   Systolic <= 0.9
## |   |   |   |   |   |   |   |   |   |   |   HDL <= 0.4625
## |   |   |   |   |   |   |   |   |   |   |   |   isDiabetic <= 0: Borderline risk (8.0/1.0)
## |   |   |   |   |   |   |   |   |   |   |   |   isDiabetic > 0: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   |   |   |   HDL > 0.4625: Borderline risk (23.0)
## |   |   |   |   |   |   |   |   |   |   Systolic > 0.9: Intermediate risk (2.0)
## |   |   |   |   |   |   HDL > 0.8125
## |   |   |   |   |   |   |   isMale <= 0: Low risk (17.0)
## |   |   |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   |   |   Age <= 0.076923: Low risk (3.0)
## |   |   |   |   |   |   |   |   Age > 0.076923: Intermediate risk (2.0)
## |   |   |   |   isBlack > 0
## |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   Systolic <= 0.554545
## |   |   |   |   |   |   |   Systolic <= 0.245455
## |   |   |   |   |   |   |   |   isMale <= 0: Low risk (2.0)
## |   |   |   |   |   |   |   |   isMale > 0: Borderline risk (15.0/1.0)
## |   |   |   |   |   |   |   Systolic > 0.245455
## |   |   |   |   |   |   |   |   isHypertensive <= 0: Low risk (20.0/2.0)
## |   |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   |   HDL <= 0.4625: Borderline risk (5.0/1.0)
## |   |   |   |   |   |   |   |   |   HDL > 0.4625: Low risk (5.0)
## |   |   |   |   |   |   Systolic > 0.554545
## |   |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   |   |   |   Age <= 0.153846: Borderline risk (5.0/1.0)
## |   |   |   |   |   |   |   |   |   Age > 0.153846: Low risk (6.0)
## |   |   |   |   |   |   |   |   isMale > 0
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.7
## |   |   |   |   |   |   |   |   |   |   Systolic <= 0.718182: Borderline risk (3.0)
## |   |   |   |   |   |   |   |   |   |   Systolic > 0.718182: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.7: Borderline risk (10.0)
## |   |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   |   Age <= 0: Borderline risk (5.0/1.0)
## |   |   |   |   |   |   |   |   Age > 0
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.071429: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.071429
## |   |   |   |   |   |   |   |   |   |   Cholesterol <= 0.871429: Intermediate risk (12.0)
## |   |   |   |   |   |   |   |   |   |   Cholesterol > 0.871429: High risk (3.0/1.0)
## |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   Systolic <= 0.309091
## |   |   |   |   |   |   |   HDL <= 0.4375
## |   |   |   |   |   |   |   |   Age <= 0.205128: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   Age > 0.205128: Borderline risk (3.0)
## |   |   |   |   |   |   |   HDL > 0.4375: Low risk (6.0)
## |   |   |   |   |   |   Systolic > 0.309091
## |   |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   |   Cholesterol <= 0.314286: Borderline risk (5.0/1.0)
## |   |   |   |   |   |   |   |   Cholesterol > 0.314286: Intermediate risk (10.0/2.0)
## |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   Age <= 0.153846
## |   |   |   |   |   |   |   |   |   Systolic <= 0.881818: Intermediate risk (9.0/1.0)
## |   |   |   |   |   |   |   |   |   Systolic > 0.881818: High risk (2.0)
## |   |   |   |   |   |   |   |   Age > 0.153846: High risk (6.0)
## |   |   Age > 0.282051
## |   |   |   Systolic <= 0.254545
## |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   isHypertensive <= 0: Low risk (20.0)
## |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   isMale <= 0
## |   |   |   |   |   |   |   Cholesterol <= 0.385714: Low risk (5.0)
## |   |   |   |   |   |   |   Cholesterol > 0.385714: Borderline risk (15.0/1.0)
## |   |   |   |   |   |   isMale > 0: Intermediate risk (6.0/1.0)
## |   |   |   |   isDiabetic > 0
## |   |   |   |   |   Age <= 0.435897
## |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   Systolic <= 0.2: Borderline risk (21.0/1.0)
## |   |   |   |   |   |   |   Systolic > 0.2: Low risk (3.0/1.0)
## |   |   |   |   |   |   isHypertensive > 0: Low risk (3.0)
## |   |   |   |   |   Age > 0.435897: Intermediate risk (10.0)
## |   |   |   Systolic > 0.254545
## |   |   |   |   isMale <= 0
## |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   Age <= 0.384615
## |   |   |   |   |   |   |   HDL <= 0.5125: Intermediate risk (2.0)
## |   |   |   |   |   |   |   HDL > 0.5125: Low risk (12.0)
## |   |   |   |   |   |   Age > 0.384615
## |   |   |   |   |   |   |   Cholesterol <= 0.814286
## |   |   |   |   |   |   |   |   Systolic <= 0.936364
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.414286
## |   |   |   |   |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   |   |   |   |   Age <= 0.512821: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   |   |   Age > 0.512821: Borderline risk (8.0)
## |   |   |   |   |   |   |   |   |   |   isSmoker > 0: Borderline risk (7.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.414286
## |   |   |   |   |   |   |   |   |   |   Age <= 0.512821: Borderline risk (5.0)
## |   |   |   |   |   |   |   |   |   |   Age > 0.512821: Intermediate risk (3.0)
## |   |   |   |   |   |   |   |   Systolic > 0.936364: Intermediate risk (2.0)
## |   |   |   |   |   |   |   Cholesterol > 0.814286: Low risk (3.0/1.0)
## |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   Systolic <= 0.609091
## |   |   |   |   |   |   |   |   isBlack <= 0
## |   |   |   |   |   |   |   |   |   Age <= 0.333333: Low risk (2.0)
## |   |   |   |   |   |   |   |   |   Age > 0.333333: Borderline risk (13.0)
## |   |   |   |   |   |   |   |   isBlack > 0
## |   |   |   |   |   |   |   |   |   isSmoker <= 0: Borderline risk (4.0)
## |   |   |   |   |   |   |   |   |   isSmoker > 0: Intermediate risk (3.0)
## |   |   |   |   |   |   |   Systolic > 0.609091
## |   |   |   |   |   |   |   |   isBlack <= 0: Intermediate risk (6.0/1.0)
## |   |   |   |   |   |   |   |   isBlack > 0: High risk (2.0)
## |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   Systolic <= 0.827273
## |   |   |   |   |   |   |   |   isBlack <= 0: Intermediate risk (9.0)
## |   |   |   |   |   |   |   |   isBlack > 0
## |   |   |   |   |   |   |   |   |   Cholesterol <= 0.814286: Intermediate risk (7.0)
## |   |   |   |   |   |   |   |   |   Cholesterol > 0.814286: High risk (2.0)
## |   |   |   |   |   |   |   Systolic > 0.827273: High risk (3.0)
## |   |   |   |   isMale > 0
## |   |   |   |   |   Cholesterol <= 0.914286
## |   |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   |   isDiabetic <= 0: Intermediate risk (18.0)
## |   |   |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   |   |   isHypertensive <= 0: Intermediate risk (6.0/1.0)
## |   |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   |   Age <= 0.435897: Borderline risk (4.0)
## |   |   |   |   |   |   |   |   |   Age > 0.435897: Intermediate risk (2.0)
## |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   isDiabetic <= 0
## |   |   |   |   |   |   |   |   isHypertensive <= 0: Intermediate risk (7.0)
## |   |   |   |   |   |   |   |   isHypertensive > 0
## |   |   |   |   |   |   |   |   |   Systolic <= 0.690909: Intermediate risk (7.0)
## |   |   |   |   |   |   |   |   |   Systolic > 0.690909: High risk (4.0)
## |   |   |   |   |   |   |   isDiabetic > 0
## |   |   |   |   |   |   |   |   Cholesterol <= 0.128571: Intermediate risk (4.0)
## |   |   |   |   |   |   |   |   Cholesterol > 0.128571: High risk (11.0/1.0)
## |   |   |   |   |   Cholesterol > 0.914286
## |   |   |   |   |   |   isHypertensive <= 0: Intermediate risk (2.0)
## |   |   |   |   |   |   isHypertensive > 0: Borderline risk (7.0)
## Age > 0.564103
## |   Systolic <= 0.5
## |   |   isDiabetic <= 0
## |   |   |   HDL <= 0.15
## |   |   |   |   Systolic <= 0.190909
## |   |   |   |   |   isMale <= 0: Low risk (2.0)
## |   |   |   |   |   isMale > 0: Intermediate risk (2.0)
## |   |   |   |   Systolic > 0.190909: High risk (9.0)
## |   |   |   HDL > 0.15
## |   |   |   |   Age <= 0.692308
## |   |   |   |   |   isSmoker <= 0
## |   |   |   |   |   |   Systolic <= 0.172727: Low risk (4.0/1.0)
## |   |   |   |   |   |   Systolic > 0.172727
## |   |   |   |   |   |   |   Age <= 0.589744: Intermediate risk (3.0/1.0)
## |   |   |   |   |   |   |   Age > 0.589744: Borderline risk (22.0)
## |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   Age <= 0.589744: Borderline risk (3.0)
## |   |   |   |   |   |   Age > 0.589744: Intermediate risk (9.0/1.0)
## |   |   |   |   Age > 0.692308
## |   |   |   |   |   HDL <= 0.975
## |   |   |   |   |   |   Systolic <= 0.427273
## |   |   |   |   |   |   |   Cholesterol <= 0.057143
## |   |   |   |   |   |   |   |   Cholesterol <= 0.028571: Intermediate risk (5.0)
## |   |   |   |   |   |   |   |   Cholesterol > 0.028571: Borderline risk (4.0)
## |   |   |   |   |   |   |   Cholesterol > 0.057143
## |   |   |   |   |   |   |   |   isSmoker <= 0: Intermediate risk (25.0/2.0)
## |   |   |   |   |   |   |   |   isSmoker > 0
## |   |   |   |   |   |   |   |   |   Age <= 0.769231: Intermediate risk (4.0)
## |   |   |   |   |   |   |   |   |   Age > 0.769231
## |   |   |   |   |   |   |   |   |   |   Systolic <= 0.072727: Intermediate risk (2.0)
## |   |   |   |   |   |   |   |   |   |   Systolic > 0.072727: High risk (4.0)
## |   |   |   |   |   |   Systolic > 0.427273: High risk (5.0)
## |   |   |   |   |   HDL > 0.975: Borderline risk (5.0)
## |   |   isDiabetic > 0
## |   |   |   isSmoker <= 0
## |   |   |   |   Systolic <= 0.318182
## |   |   |   |   |   Age <= 0.820513: Intermediate risk (18.0/1.0)
## |   |   |   |   |   Age > 0.820513
## |   |   |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   |   |   Age <= 0.948718: Intermediate risk (4.0)
## |   |   |   |   |   |   |   Age > 0.948718: High risk (4.0/1.0)
## |   |   |   |   |   |   isHypertensive > 0: High risk (3.0)
## |   |   |   |   Systolic > 0.318182: High risk (10.0/1.0)
## |   |   |   isSmoker > 0
## |   |   |   |   isHypertensive <= 0
## |   |   |   |   |   isBlack <= 0
## |   |   |   |   |   |   Age <= 0.794872: Intermediate risk (4.0)
## |   |   |   |   |   |   Age > 0.794872: High risk (2.0)
## |   |   |   |   |   isBlack > 0: High risk (4.0)
## |   |   |   |   isHypertensive > 0: High risk (28.0)
## |   Systolic > 0.5
## |   |   Age <= 0.589744
## |   |   |   isDiabetic <= 0: Borderline risk (4.0/1.0)
## |   |   |   isDiabetic > 0: High risk (7.0/1.0)
## |   |   Age > 0.589744: High risk (141.0/7.0)
## 
## Number of Leaves  :  131
## 
## Size of the tree :   261
# Calculate performance metrics
accuracy_I1 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_I1 <-( 1 - accuracy_I1)
sensitivity_I1 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_I1 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_I1 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])

# Display performance metrics
cat("Accuracy: ", accuracy_I1, "\n")
## Accuracy:  0.7753165
cat("Error Rate: ", error_rate_I1, "\n")
## Error Rate:  0.2246835
cat("Sensitivity (Recall): ", sensitivity_I1, "\n")
## Sensitivity (Recall):  0.7638889
cat("Specificity: ", specificity_I1, "\n")
## Specificity:  0.7786885
cat("Precision: ", precision_I1, "\n")
## Precision:  0.7236842
# Display a summary of the decision tree
summary(c50_model)
## 
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Dec  2 16:40:38 2023
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 959 cases (10 attributes) from undefined.data
## 
## Decision tree:
## 
## Age <= 0.5641026:
## :...HDL <= 0.225:
## :   :...Systolic > 0.5545455:
## :   :   :...Cholesterol <= 0.01428571: Borderline risk (5/1)
## :   :   :   Cholesterol > 0.01428571:
## :   :   :   :...isSmoker > 0: High risk (29/6)
## :   :   :       isSmoker <= 0:
## :   :   :       :...Systolic <= 0.6727273: High risk (3)
## :   :   :           Systolic > 0.6727273:
## :   :   :           :...Age <= 0.4615385: Intermediate risk (9)
## :   :   :               Age > 0.4615385: High risk (3/1)
## :   :   Systolic <= 0.5545455:
## :   :   :...isHypertensive > 0:
## :   :       :...isDiabetic > 0:
## :   :       :   :...isMale <= 0: Intermediate risk (11/1)
## :   :       :   :   isMale > 0:
## :   :       :   :   :...isBlack <= 0: High risk (3/1)
## :   :       :   :       isBlack > 0: Intermediate risk (4/1)
## :   :       :   isDiabetic <= 0:
## :   :       :   :...Systolic > 0.3090909: Intermediate risk (10)
## :   :       :       Systolic <= 0.3090909:
## :   :       :       :...Systolic > 0.1909091: Borderline risk (5/1)
## :   :       :           Systolic <= 0.1909091:
## :   :       :           :...isMale <= 0: Low risk (6)
## :   :       :               isMale > 0: Intermediate risk (3)
## :   :       isHypertensive <= 0:
## :   :       :...Age <= 0.02564103: Low risk (6)
## :   :           Age > 0.02564103:
## :   :           :...HDL <= 0.0125: Intermediate risk (6)
## :   :               HDL > 0.0125:
## :   :               :...Cholesterol <= 0.2:
## :   :                   :...Systolic <= 0.2909091: Low risk (3)
## :   :                   :   Systolic > 0.2909091: Intermediate risk (3)
## :   :                   Cholesterol > 0.2:
## :   :                   :...Cholesterol > 0.8714285: Low risk (2/1)
## :   :                       Cholesterol <= 0.8714285:
## :   :                       :...Age <= 0.05128205: Intermediate risk (2)
## :   :                           Age > 0.05128205: Borderline risk (30/4)
## :   HDL > 0.225:
## :   :...Age <= 0.2820513:
## :       :...isBlack <= 0:
## :       :   :...Cholesterol > 0.5571429:
## :       :   :   :...Systolic <= 0.1636364: Low risk (5)
## :       :   :   :   Systolic > 0.1636364:
## :       :   :   :   :...isSmoker <= 0:
## :       :   :   :       :...Age <= 0.2307692: Low risk (15)
## :       :   :   :       :   Age > 0.2307692: Borderline risk (8/1)
## :       :   :   :       isSmoker > 0:
## :       :   :   :       :...HDL <= 0.7375: Borderline risk (33/4)
## :       :   :   :           HDL > 0.7375: Low risk (8/1)
## :       :   :   Cholesterol <= 0.5571429:
## :       :   :   :...Systolic <= 0.7181818: Low risk (78)
## :       :   :       Systolic > 0.7181818:
## :       :   :       :...isDiabetic <= 0: Low risk (18/1)
## :       :   :           isDiabetic > 0:
## :       :   :           :...HDL > 0.6375: Borderline risk (11)
## :       :   :               HDL <= 0.6375:
## :       :   :               :...Systolic <= 0.9: Low risk (5)
## :       :   :                   Systolic > 0.9: Intermediate risk (2)
## :       :   isBlack > 0:
## :       :   :...Systolic <= 0.5363637:
## :       :       :...isMale <= 0:
## :       :       :   :...Cholesterol <= 0.8285714: Low risk (30/1)
## :       :       :   :   Cholesterol > 0.8285714: Borderline risk (2)
## :       :       :   isMale > 0:
## :       :       :   :...isDiabetic > 0:
## :       :       :       :...Systolic <= 0.07272727: Low risk (4)
## :       :       :       :   Systolic > 0.07272727: Intermediate risk (9/1)
## :       :       :       isDiabetic <= 0:
## :       :       :       :...isSmoker <= 0:
## :       :       :           :...isHypertensive <= 0: Low risk (9)
## :       :       :           :   isHypertensive > 0: Borderline risk (6/1)
## :       :       :           isSmoker > 0:
## :       :       :           :...Age <= 0.1794872: Borderline risk (12/1)
## :       :       :               Age > 0.1794872: Intermediate risk (2)
## :       :       Systolic > 0.5363637:
## :       :       :...isHypertensive <= 0:
## :       :           :...Age <= 0.2051282:
## :       :           :   :...isMale <= 0:
## :       :           :   :   :...Age <= 0.1282051: Borderline risk (5)
## :       :           :   :   :   Age > 0.1282051: Low risk (6/1)
## :       :           :   :   isMale > 0:
## :       :           :   :   :...Cholesterol <= 0.6857143: Intermediate risk (5/1)
## :       :           :   :       Cholesterol > 0.6857143: Borderline risk (8)
## :       :           :   Age > 0.2051282:
## :       :           :   :...isSmoker > 0: Intermediate risk (4)
## :       :           :       isSmoker <= 0:
## :       :           :       :...Age <= 0.2564103: Intermediate risk (2)
## :       :           :           Age > 0.2564103: Low risk (2)
## :       :           isHypertensive > 0:
## :       :           :...Systolic > 0.8909091: High risk (7)
## :       :               Systolic <= 0.8909091:
## :       :               :...Age <= 0.07692308:
## :       :                   :...HDL <= 0.5625: Intermediate risk (3)
## :       :                   :   HDL > 0.5625: Borderline risk (7)
## :       :                   Age > 0.07692308:
## :       :                   :...Age <= 0.1794872: Intermediate risk (7)
## :       :                       Age > 0.1794872:
## :       :                       :...isDiabetic <= 0: Intermediate risk (4/1)
## :       :                           isDiabetic > 0: High risk (2)
## :       Age > 0.2820513:
## :       :...Systolic > 0.7090909:
## :           :...Systolic <= 0.9:
## :           :   :...isSmoker <= 0: Intermediate risk (12)
## :           :   :   isSmoker > 0:
## :           :   :   :...Age <= 0.3846154: Intermediate risk (6)
## :           :   :       Age > 0.3846154: High risk (7/1)
## :           :   Systolic > 0.9:
## :           :   :...isDiabetic > 0: High risk (4)
## :           :       isDiabetic <= 0:
## :           :       :...Systolic <= 0.9363636: Borderline risk (7)
## :           :           Systolic > 0.9363636: Intermediate risk (5/1)
## :           Systolic <= 0.7090909:
## :           :...isDiabetic <= 0:
## :               :...isMale > 0:
## :               :   :...Systolic > 0.6636364: Borderline risk (7)
## :               :   :   Systolic <= 0.6636364:
## :               :   :   :...isHypertensive > 0: Intermediate risk (18)
## :               :   :       isHypertensive <= 0:
## :               :   :       :...isSmoker <= 0: Low risk (5)
## :               :   :           isSmoker > 0: Intermediate risk (2)
## :               :   isMale <= 0:
## :               :   :...Age <= 0.4871795:
## :               :       :...Systolic <= 0.3818182: Low risk (19)
## :               :       :   Systolic > 0.3818182:
## :               :       :   :...HDL <= 0.55: Borderline risk (3)
## :               :       :       HDL > 0.55: Low risk (10/1)
## :               :       Age > 0.4871795:
## :               :       :...Cholesterol <= 0.3: Low risk (3)
## :               :           Cholesterol > 0.3:
## :               :           :...Systolic <= 0.3636364: Borderline risk (13)
## :               :               Systolic > 0.3636364:
## :               :               :...Cholesterol <= 0.4142857: Borderline risk (2)
## :               :                   Cholesterol > 0.4142857: Intermediate risk (2)
## :               isDiabetic > 0:
## :               :...Age > 0.4615385:
## :                   :...Cholesterol <= 0.3285714: Borderline risk (2)
## :                   :   Cholesterol > 0.3285714: Intermediate risk (19/1)
## :                   Age <= 0.4615385:
## :                   :...isSmoker > 0:
## :                       :...Cholesterol <= 0.5571429:
## :                       :   :...isHypertensive <= 0: Borderline risk (14/2)
## :                       :   :   isHypertensive > 0: High risk (2)
## :                       :   Cholesterol > 0.5571429:
## :                       :   :...isBlack <= 0: Intermediate risk (3)
## :                       :       isBlack > 0: High risk (3/1)
## :                       isSmoker <= 0:
## :                       :...isMale <= 0:
## :                           :...isHypertensive <= 0:
## :                           :   :...HDL <= 0.675: Borderline risk (8/1)
## :                           :   :   HDL > 0.675: Low risk (4)
## :                           :   isHypertensive > 0:
## :                           :   :...Systolic <= 0.2909091: Low risk (2)
## :                           :       Systolic > 0.2909091: Intermediate risk (2)
## :                           isMale > 0:
## :                           :...Systolic <= 0.07272727: Borderline risk (12)
## :                               Systolic > 0.07272727: [S1]
## Age > 0.5641026:
## :...Systolic > 0.5: High risk (128/10)
##     Systolic <= 0.5:
##     :...isDiabetic > 0:
##         :...isSmoker > 0:
##         :   :...isHypertensive > 0: High risk (22)
##         :   :   isHypertensive <= 0:
##         :   :   :...Age <= 0.7948718: Intermediate risk (5/1)
##         :   :       Age > 0.7948718: High risk (4)
##         :   isSmoker <= 0:
##         :   :...Systolic > 0.3181818: High risk (9/1)
##         :       Systolic <= 0.3181818:
##         :       :...Age > 0.9230769: High risk (4)
##         :           Age <= 0.9230769:
##         :           :...isHypertensive <= 0: Intermediate risk (8)
##         :               isHypertensive > 0:
##         :               :...Age <= 0.8205128: Intermediate risk (11/1)
##         :                   Age > 0.8205128: High risk (2)
##         isDiabetic <= 0:
##         :...HDL <= 0.15:
##             :...Systolic > 0.1909091: High risk (9)
##             :   Systolic <= 0.1909091:
##             :   :...isMale <= 0: Low risk (2)
##             :       isMale > 0: Intermediate risk (2)
##             HDL > 0.15:
##             :...Systolic > 0.4272727:
##                 :...Age <= 0.7692308: Borderline risk (5)
##                 :   Age > 0.7692308: High risk (5)
##                 Systolic <= 0.4272727:
##                 :...Cholesterol > 0.7142857:
##                     :...Age <= 0.8974359: Intermediate risk (12)
##                     :   Age > 0.8974359:
##                     :   :...Systolic <= 0.2090909: Intermediate risk (3/1)
##                     :       Systolic > 0.2090909: High risk (3)
##                     Cholesterol <= 0.7142857:
##                     :...Systolic > 0.2909091: Intermediate risk (15/1)
##                         Systolic <= 0.2909091:
##                         :...isHypertensive > 0:
##                             :...Systolic > 0.1727273: Borderline risk (7/1)
##                             :   Systolic <= 0.1727273:
##                             :   :...Age <= 0.6923077: Low risk (3)
##                             :       Age > 0.6923077: Intermediate risk (3)
##                             isHypertensive <= 0:
##                             :...HDL <= 0.6: Intermediate risk (7)
##                                 HDL > 0.6:
##                                 :...Cholesterol <= 0.3714286: Intermediate risk (3)
##                                     Cholesterol > 0.3714286:
##                                     :...Systolic <= 0.05454545: Intermediate risk (2)
##                                         Systolic > 0.05454545: Borderline risk (18)
## 
## SubTree [S1]
## 
## isHypertensive <= 0: Intermediate risk (5)
## isHypertensive > 0: Borderline risk (4)
## 
## 
## Evaluation on training data (959 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##     105   55( 5.7%)   <<
## 
## 
##     (a)   (b)   (c)   (d)    <-classified as
##    ----  ----  ----  ----
##     239     7                (a): class Low risk
##       1   217           3    (b): class Borderline risk
##       4     8   220    18    (c): class Intermediate risk
##       1     2    11   228    (d): class High risk
## 
## 
##  Attribute usage:
## 
##  100.00% Age
##  100.00% Systolic
##   79.87% HDL
##   49.01% isDiabetic
##   47.55% Cholesterol
##   34.62% isBlack
##   34.62% isHypertensive
##   31.39% isSmoker
##   26.07% isMale
## 
## 
## Time: 0.0 secs

Analysis: The C5 model demonstrates strong predictive capabilities with an accuracy of 78.37%. It effectively identifies instances of low risk (sensitivity of 80.6%) and maintains high specificity (77.6%) in recognizing non-low-risk instances. The precision of 72.26% highlights the accuracy of positive predictions. The model’s tree structure, comprising 120 nodes, reflects its complexity in capturing patterns within the data. These results suggest a well-balanced model with the potential for reliable classification across multiple risk categories.

  • Age plays a critical role as it is the root in risk determination. For individuals with an age value of 0.5641026 or less, the risk varies based on other factors.

  • HDL cholesterol and systolic blood pressure are the next important attributes, with lower HDL and higher systolic values generally increasing the risk classification.

  • Cholesterol levels are used to further stratify risk, especially when combined with smoking status and systolic blood pressure measurements.

  • For individuals who are hypertensive or diabetic, the risk of being classified as ‘High’ increases, particularly if they are also male and non-black.

2-partition the data into ( 70% training, 30% testing):

set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.70 , 0.30))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# install.packages("C50")
library(C50)

# Define the formula
myFormula <- Risk ~ .

# Build the C5.0 decision tree on the training data with information gain
c50_model <- C5.0(myFormula, data = trainData)

# Display a summary of the decision tree
print(c50_model)
## 
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
## 
## Classification Tree
## Number of samples: 1132 
## Number of predictors: 9 
## 
## Tree size: 135 
## 
## Non-standard options: attempt to group attributes
# Make predictions using the C5.0 model on the test data
testPred <- predict(c50_model, newdata = testData)


# Calculate performance metrics
accuracy_I2 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_I2 <-( 1 - accuracy_I2)
sensitivity_I2 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_I2 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_I2 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])

# Display performance metrics
cat("Accuracy: ", accuracy_I2, "\n")
## Accuracy:  0.7753165
cat("Error Rate: ", error_rate_I2, "\n")
## Error Rate:  0.2246835
cat("Sensitivity (Recall): ", sensitivity_I2, "\n")
## Sensitivity (Recall):  0.7638889
cat("Specificity: ", specificity_I2, "\n")
## Specificity:  0.7786885
cat("Precision: ", precision_I2, "\n")
## Precision:  0.7236842
# Display a summary of the decision tree
summary(c50_model)
## 
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Dec  2 16:40:38 2023
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 1132 cases (10 attributes) from undefined.data
## 
## Decision tree:
## 
## Age > 0.5641026:
## :...Systolic > 0.5:
## :   :...Age > 0.5897436: High risk (141/7)
## :   :   Age <= 0.5897436:
## :   :   :...isDiabetic <= 0: Borderline risk (4/1)
## :   :       isDiabetic > 0: High risk (7/1)
## :   Systolic <= 0.5:
## :   :...isDiabetic > 0:
## :       :...isSmoker > 0:
## :       :   :...isHypertensive > 0: High risk (28)
## :       :   :   isHypertensive <= 0:
## :       :   :   :...isBlack > 0: High risk (4)
## :       :   :       isBlack <= 0:
## :       :   :       :...Age <= 0.7948718: Intermediate risk (4)
## :       :   :           Age > 0.7948718: High risk (2)
## :       :   isSmoker <= 0:
## :       :   :...Systolic > 0.3181818: High risk (10/1)
## :       :       Systolic <= 0.3181818:
## :       :       :...Age <= 0.8205128: Intermediate risk (18/1)
## :       :           Age > 0.8205128:
## :       :           :...isHypertensive > 0: High risk (3)
## :       :               isHypertensive <= 0:
## :       :               :...Age <= 0.948718: Intermediate risk (4)
## :       :                   Age > 0.948718: High risk (4/1)
## :       isDiabetic <= 0:
## :       :...HDL <= 0.15:
## :           :...Systolic > 0.1909091: High risk (9)
## :           :   Systolic <= 0.1909091:
## :           :   :...isMale <= 0: Low risk (2)
## :           :       isMale > 0: Intermediate risk (2)
## :           HDL > 0.15:
## :           :...Age <= 0.6923077:
## :               :...isSmoker > 0:
## :               :   :...Age <= 0.5897436: Borderline risk (3)
## :               :   :   Age > 0.5897436: Intermediate risk (9/1)
## :               :   isSmoker <= 0:
## :               :   :...Systolic <= 0.1727273: Low risk (4/1)
## :               :       Systolic > 0.1727273:
## :               :       :...Age <= 0.5897436: Intermediate risk (3/1)
## :               :           Age > 0.5897436: Borderline risk (22)
## :               Age > 0.6923077:
## :               :...HDL > 0.975: Borderline risk (5)
## :                   HDL <= 0.975:
## :                   :...Systolic > 0.4272727: High risk (5)
## :                       Systolic <= 0.4272727:
## :                       :...Cholesterol <= 0.05714286:
## :                           :...Cholesterol <= 0.02857143: Intermediate risk (5)
## :                           :   Cholesterol > 0.02857143: Borderline risk (4)
## :                           Cholesterol > 0.05714286:
## :                           :...isSmoker <= 0:
## :                               :...Age <= 0.9230769: Intermediate risk (19)
## :                               :   Age > 0.9230769:
## :                               :   :...isMale <= 0: Intermediate risk (4)
## :                               :       isMale > 0: High risk (2)
## :                               isSmoker > 0:
## :                               :...Age <= 0.7692308: Intermediate risk (4)
## :                                   Age > 0.7692308:
## :                                   :...Systolic <= 0.07272727: Intermediate risk (2)
## :                                       Systolic > 0.07272727: High risk (4)
## Age <= 0.5641026:
## :...HDL <= 0.225:
##     :...Systolic > 0.5545455:
##     :   :...Cholesterol <= 0.01428571: Borderline risk (5/1)
##     :   :   Cholesterol > 0.01428571:
##     :   :   :...isSmoker <= 0:
##     :   :       :...isDiabetic <= 0: Intermediate risk (12/1)
##     :   :       :   isDiabetic > 0:
##     :   :       :   :...isMale <= 0: Intermediate risk (4/1)
##     :   :       :       isMale > 0: High risk (5)
##     :   :       isSmoker > 0:
##     :   :       :...isDiabetic <= 0:
##     :   :           :...Cholesterol <= 0.2428571: Intermediate risk (2)
##     :   :           :   Cholesterol > 0.2428571: High risk (13/2)
##     :   :           isDiabetic > 0:
##     :   :           :...HDL <= 0.2: High risk (17)
##     :   :               HDL > 0.2: Intermediate risk (3/1)
##     :   Systolic <= 0.5545455:
##     :   :...isDiabetic <= 0:
##     :       :...isSmoker > 0:
##     :       :   :...isBlack > 0: Intermediate risk (13/1)
##     :       :   :   isBlack <= 0:
##     :       :   :   :...Systolic <= 0.3090909: Borderline risk (9/1)
##     :       :   :       Systolic > 0.3090909: Intermediate risk (2)
##     :       :   isSmoker <= 0:
##     :       :   :...Age <= 0.2564103: Low risk (13)
##     :       :       Age > 0.2564103:
##     :       :       :...isHypertensive <= 0:
##     :       :           :...Cholesterol <= 0.2571429: Intermediate risk (2)
##     :       :           :   Cholesterol > 0.2571429: Borderline risk (12/1)
##     :       :           isHypertensive > 0:
##     :       :           :...Systolic <= 0.08181818: Low risk (3)
##     :       :               Systolic > 0.08181818:
##     :       :               :...Age <= 0.5384616: Intermediate risk (5)
##     :       :                   Age > 0.5384616: Borderline risk (2)
##     :       isDiabetic > 0:
##     :       :...Age > 0.4102564:
##     :           :...Cholesterol <= 0.2714286: Low risk (3/1)
##     :           :   Cholesterol > 0.2714286: Intermediate risk (12/1)
##     :           Age <= 0.4102564:
##     :           :...Cholesterol <= 0.3571429: Intermediate risk (8/1)
##     :               Cholesterol > 0.3571429:
##     :               :...isHypertensive <= 0: Borderline risk (17/1)
##     :                   isHypertensive > 0:
##     :                   :...Cholesterol > 0.6857143: Intermediate risk (4)
##     :                       Cholesterol <= 0.6857143:
##     :                       :...isSmoker <= 0: Borderline risk (3)
##     :                           isSmoker > 0: High risk (2)
##     HDL > 0.225:
##     :...Age > 0.2820513:
##         :...Systolic <= 0.2545455:
##         :   :...Age > 0.5384616: Intermediate risk (5)
##         :   :   Age <= 0.5384616:
##         :   :   :...isDiabetic <= 0:
##         :   :       :...isHypertensive <= 0: Low risk (20)
##         :   :       :   isHypertensive > 0:
##         :   :       :   :...Cholesterol <= 0.4:
##         :   :       :       :...isMale <= 0: Low risk (5)
##         :   :       :       :   isMale > 0: Intermediate risk (2)
##         :   :       :       Cholesterol > 0.4:
##         :   :       :       :...isSmoker <= 0: Low risk (2)
##         :   :       :           isSmoker > 0: Borderline risk (14)
##         :   :       isDiabetic > 0:
##         :   :       :...Age > 0.4358974: Intermediate risk (8)
##         :   :           Age <= 0.4358974:
##         :   :           :...isHypertensive > 0: Low risk (3)
##         :   :               isHypertensive <= 0:
##         :   :               :...Systolic <= 0.2: Borderline risk (21/1)
##         :   :                   Systolic > 0.2: Low risk (3/1)
##         :   Systolic > 0.2545455:
##         :   :...isMale > 0:
##         :       :...Cholesterol > 0.9142857:
##         :       :   :...isHypertensive <= 0: Intermediate risk (2)
##         :       :   :   isHypertensive > 0: Borderline risk (7)
##         :       :   Cholesterol <= 0.9142857:
##         :       :   :...isSmoker <= 0:
##         :       :       :...isDiabetic <= 0: Intermediate risk (18)
##         :       :       :   isDiabetic > 0:
##         :       :       :   :...isHypertensive <= 0: Intermediate risk (6/1)
##         :       :       :       isHypertensive > 0:
##         :       :       :       :...Age <= 0.4358974: Borderline risk (4)
##         :       :       :           Age > 0.4358974: Intermediate risk (2)
##         :       :       isSmoker > 0:
##         :       :       :...isDiabetic > 0:
##         :       :           :...Cholesterol <= 0.1285714: Intermediate risk (4)
##         :       :           :   Cholesterol > 0.1285714: High risk (11/1)
##         :       :           isDiabetic <= 0:
##         :       :           :...Systolic <= 0.6909091: Intermediate risk (11)
##         :       :               Systolic > 0.6909091: [S1]
##         :       isMale <= 0:
##         :       :...isDiabetic > 0:
##         :           :...isHypertensive <= 0:
##         :           :   :...Systolic > 0.6090909:
##         :           :   :   :...isBlack <= 0: Intermediate risk (6/1)
##         :           :   :   :   isBlack > 0: High risk (2)
##         :           :   :   Systolic <= 0.6090909:
##         :           :   :   :...isBlack <= 0:
##         :           :   :       :...Age <= 0.3333333: Low risk (2)
##         :           :   :       :   Age > 0.3333333: Borderline risk (13)
##         :           :   :       isBlack > 0:
##         :           :   :       :...isSmoker <= 0: Borderline risk (4)
##         :           :   :           isSmoker > 0: Intermediate risk (3)
##         :           :   isHypertensive > 0:
##         :           :   :...Systolic > 0.8272727: High risk (3)
##         :           :       Systolic <= 0.8272727:
##         :           :       :...isBlack <= 0: Intermediate risk (9)
##         :           :           isBlack > 0:
##         :           :           :...Cholesterol <= 0.8142857: Intermediate risk (7)
##         :           :               Cholesterol > 0.8142857: High risk (2)
##         :           isDiabetic <= 0:
##         :           :...Age <= 0.3846154:
##         :               :...HDL <= 0.5125: Intermediate risk (2)
##         :               :   HDL > 0.5125: Low risk (12)
##         :               Age > 0.3846154:
##         :               :...Cholesterol > 0.8142857: Low risk (3/1)
##         :                   Cholesterol <= 0.8142857:
##         :                   :...Systolic > 0.9363636: Intermediate risk (2)
##         :                       Systolic <= 0.9363636:
##         :                       :...Cholesterol > 0.4142857:
##         :                           :...Age <= 0.5128205: Borderline risk (5)
##         :                           :   Age > 0.5128205: Intermediate risk (3)
##         :                           Cholesterol <= 0.4142857:
##         :                           :...HDL <= 0.625: Borderline risk (10)
##         :                               HDL > 0.625: [S2]
##         Age <= 0.2820513:
##         :...Systolic <= 0.1636364:
##             :...isBlack <= 0: Low risk (44)
##             :   isBlack > 0:
##             :   :...isMale <= 0: Low risk (9)
##             :       isMale > 0:
##             :       :...Systolic <= 0.09090909: Low risk (6)
##             :           Systolic > 0.09090909: Intermediate risk (4)
##             Systolic > 0.1636364:
##             :...isBlack > 0:
##                 :...isDiabetic > 0:
##                 :   :...Systolic <= 0.3090909:
##                 :   :   :...HDL > 0.45: Low risk (6)
##                 :   :   :   HDL <= 0.45:
##                 :   :   :   :...Age <= 0.2051282: Intermediate risk (2)
##                 :   :   :       Age > 0.2051282: Borderline risk (3)
##                 :   :   Systolic > 0.3090909:
##                 :   :   :...isHypertensive <= 0:
##                 :   :       :...Cholesterol <= 0.3142857: Borderline risk (5/1)
##                 :   :       :   Cholesterol > 0.3142857: Intermediate risk (10/2)
##                 :   :       isHypertensive > 0:
##                 :   :       :...Age > 0.1538462: High risk (6)
##                 :   :           Age <= 0.1538462:
##                 :   :           :...Systolic <= 0.8818182: Intermediate risk (9/1)
##                 :   :               Systolic > 0.8818182: High risk (2)
##                 :   isDiabetic <= 0:
##                 :   :...Systolic <= 0.5545455:
##                 :       :...Systolic <= 0.2454545:
##                 :       :   :...isMale <= 0: Low risk (2)
##                 :       :   :   isMale > 0: Borderline risk (15/1)
##                 :       :   Systolic > 0.2454545:
##                 :       :   :...isHypertensive <= 0: Low risk (20/2)
##                 :       :       isHypertensive > 0:
##                 :       :       :...HDL <= 0.4625: Borderline risk (5/1)
##                 :       :           HDL > 0.4625: Low risk (5)
##                 :       Systolic > 0.5545455:
##                 :       :...isSmoker <= 0:
##                 :           :...isMale <= 0:
##                 :           :   :...Age <= 0.1538462: Borderline risk (5/1)
##                 :           :   :   Age > 0.1538462: Low risk (6)
##                 :           :   isMale > 0:
##                 :           :   :...Systolic <= 0.7181818: Borderline risk (9)
##                 :           :       Systolic > 0.7181818:
##                 :           :       :...Systolic <= 0.8727273: Intermediate risk (2)
##                 :           :           Systolic > 0.8727273: Borderline risk (4)
##                 :           isSmoker > 0:
##                 :           :...Cholesterol <= 0.07142857: Low risk (2)
##                 :               Cholesterol > 0.07142857:
##                 :               :...Age <= 0: Borderline risk (5/1)
##                 :                   Age > 0: [S3]
##                 isBlack <= 0:
##                 :...Cholesterol <= 0.2428571: Low risk (38/1)
##                     Cholesterol > 0.2428571:
##                     :...HDL > 0.8125:
##                         :...isMale <= 0: Low risk (17)
##                         :   isMale > 0:
##                         :   :...Age <= 0.07692308: Low risk (3)
##                         :       Age > 0.07692308: Intermediate risk (2)
##                         HDL <= 0.8125:
##                         :...isSmoker <= 0:
##                             :...Age <= 0.2307692: Low risk (31)
##                             :   Age > 0.2307692:
##                             :   :...isMale <= 0: Low risk (2)
##                             :       isMale > 0: Borderline risk (12)
##                             isSmoker > 0:
##                             :...Systolic <= 0.3090909:
##                                 :...Cholesterol <= 0.6571429: Low risk (9)
##                                 :   Cholesterol > 0.6571429: Borderline risk (3)
##                                 Systolic > 0.3090909:
##                                 :...isMale <= 0: [S4]
##                                     isMale > 0:
##                                     :...Systolic > 0.9: Intermediate risk (2)
##                                         Systolic <= 0.9:
##                                         :...HDL > 0.4625: Borderline risk (23)
##                                             HDL <= 0.4625: [S5]
## 
## SubTree [S1]
## 
## isHypertensive <= 0: Intermediate risk (3)
## isHypertensive > 0: High risk (4)
## 
## SubTree [S2]
## 
## isSmoker <= 0: Low risk (2)
## isSmoker > 0: Borderline risk (5)
## 
## SubTree [S3]
## 
## Cholesterol <= 0.8714285: Intermediate risk (12)
## Cholesterol > 0.8714285: High risk (3/1)
## 
## SubTree [S4]
## 
## isHypertensive <= 0: Borderline risk (17/1)
## isHypertensive > 0: Low risk (4)
## 
## SubTree [S5]
## 
## isDiabetic <= 0: Borderline risk (8/1)
## isDiabetic > 0: Intermediate risk (2)
## 
## 
## Evaluation on training data (1132 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##     135   48( 4.2%)   <<
## 
## 
##     (a)   (b)   (c)   (d)    <-classified as
##    ----  ----  ----  ----
##     274     3     4          (a): class Low risk
##       1   270                (b): class Borderline risk
##       5     6   265    14    (c): class Intermediate risk
##       1     4    10   275    (d): class High risk
## 
## 
##  Attribute usage:
## 
##  100.00% Age
##  100.00% Systolic
##   79.77% HDL
##   65.90% isDiabetic
##   46.73% isSmoker
##   45.23% Cholesterol
##   40.28% isBlack
##   30.65% isMale
##   29.24% isHypertensive
## 
## 
## Time: 0.0 secs

Analysis: The C5 model achieved an accuracy of 78.07%, demonstrating its proficiency in making correct predictions across all classes. It exhibits robust sensitivity (77.23%), effectively identifying instances of high risk. The model’s specificity (78.31%) suggests improved accuracy in correctly identifying non-high-risk instances compared to the previous configuration. The precision of 72.90% reflects the accuracy of positive predictions. The tree structure, comprising 125 nodes, signifies a moderate level of complexity. Overall, the model performs well, with enhanced specificity, showcasing its suitability for this classification task.

  • Age is the primary split, indicating its importance as a predictor. For individuals older than 0.5641026 , the risk generally increases.

  • Systolic blood pressure is another crucial factor, with higher values leading to a ‘High risk’ classification, particularly in the presence of diabetes.

  • Diabetes status (isDiabetic) is a significant differentiator for risk levels. Diabetic individuals tend to be classified as ‘High risk’ more frequently, especially when combined with other risk factors like smoking or higher age.

3-partition the data into ( 80% training, 20% testing):sting):

set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.80 , 0.20))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
# install.packages("C50")
library(C50)

# Define the formula
myFormula <- Risk ~ .

# Build the C5.0 decision tree on the training data with information gain
c50_model <- C5.0(myFormula, data = trainData)

# Display a summary of the decision tree
print(c50_model)
## 
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
## 
## Classification Tree
## Number of samples: 1272 
## Number of predictors: 9 
## 
## Tree size: 155 
## 
## Non-standard options: attempt to group attributes
# Make predictions using the C5.0 model on the test data
testPred <- predict(c50_model, newdata = testData)

# Display a summary of the decision tree
summary(c50_model)
## 
## Call:
## C5.0.formula(formula = myFormula, data = trainData)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Dec  2 16:40:39 2023
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 1272 cases (10 attributes) from undefined.data
## 
## Decision tree:
## 
## Age > 0.5641026:
## :...Systolic > 0.4909091:
## :   :...Age <= 0.5897436:
## :   :   :...isDiabetic <= 0: Borderline risk (4/1)
## :   :   :   isDiabetic > 0: High risk (7/1)
## :   :   Age > 0.5897436:
## :   :   :...isSmoker > 0: High risk (86)
## :   :       isSmoker <= 0:
## :   :       :...isMale > 0: High risk (37/1)
## :   :           isMale <= 0:
## :   :           :...Age <= 0.6666667: Intermediate risk (5/1)
## :   :               Age > 0.6666667:
## :   :               :...isDiabetic > 0: High risk (16)
## :   :                   isDiabetic <= 0:
## :   :                   :...Systolic <= 0.8181818: Intermediate risk (5/1)
## :   :                       Systolic > 0.8181818: High risk (7)
## :   Systolic <= 0.4909091:
## :   :...isDiabetic > 0:
## :       :...isSmoker > 0:
## :       :   :...isHypertensive > 0: High risk (32)
## :       :   :   isHypertensive <= 0:
## :       :   :   :...isBlack > 0: High risk (6)
## :       :   :       isBlack <= 0:
## :       :   :       :...Age <= 0.6923077: Intermediate risk (5)
## :       :   :           Age > 0.6923077: High risk (4)
## :       :   isSmoker <= 0:
## :       :   :...Systolic > 0.3181818:
## :       :       :...Cholesterol <= 0.2: Intermediate risk (2)
## :       :       :   Cholesterol > 0.2: High risk (10)
## :       :       Systolic <= 0.3181818:
## :       :       :...Age <= 0.8205128: Intermediate risk (22/1)
## :       :           Age > 0.8205128:
## :       :           :...isHypertensive > 0: High risk (4)
## :       :               isHypertensive <= 0:
## :       :               :...isBlack <= 0: High risk (2)
## :       :                   isBlack > 0:
## :       :                   :...Systolic <= 0.04545455: High risk (2)
## :       :                       Systolic > 0.04545455: Intermediate risk (5)
## :       isDiabetic <= 0:
## :       :...Age <= 0.6923077:
## :           :...HDL <= 0.1125:
## :           :   :...isHypertensive <= 0: Low risk (2)
## :           :   :   isHypertensive > 0: High risk (3)
## :           :   HDL > 0.1125:
## :           :   :...isSmoker > 0:
## :           :       :...Systolic <= 0.08181818: Borderline risk (4)
## :           :       :   Systolic > 0.08181818: Intermediate risk (11/1)
## :           :       isSmoker <= 0:
## :           :       :...Systolic <= 0.1818182: Low risk (6/1)
## :           :           Systolic > 0.1818182:
## :           :           :...Age <= 0.5897436: Intermediate risk (3/1)
## :           :               Age > 0.5897436: Borderline risk (27)
## :           Age > 0.6923077:
## :           :...HDL > 0.975: Borderline risk (5)
## :               HDL <= 0.975:
## :               :...Age <= 0.7692308:
## :                   :...Cholesterol <= 0.05714286: Borderline risk (5/1)
## :                   :   Cholesterol > 0.05714286: Intermediate risk (14)
## :                   Age > 0.7692308:
## :                   :...isSmoker > 0:
## :                       :...Systolic > 0.2272727: High risk (10)
## :                       :   Systolic <= 0.2272727:
## :                       :   :...isBlack <= 0: Intermediate risk (3)
## :                       :       isBlack > 0: High risk (3/1)
## :                       isSmoker <= 0:
## :                       :...Systolic > 0.4272727: High risk (3)
## :                           Systolic <= 0.4272727:
## :                           :...Age <= 0.9230769: Intermediate risk (17)
## :                               Age > 0.9230769:
## :                               :...isBlack <= 0: High risk (2)
## :                                   isBlack > 0: Intermediate risk (6/1)
## Age <= 0.5641026:
## :...Age <= 0.3333333:
##     :...HDL <= 0.25:
##     :   :...Systolic > 0.5454546:
##     :   :   :...Systolic <= 0.6181818:
##     :   :   :   :...isDiabetic > 0: High risk (6/1)
##     :   :   :   :   isDiabetic <= 0:
##     :   :   :   :   :...isHypertensive <= 0: Borderline risk (10)
##     :   :   :   :       isHypertensive > 0: Intermediate risk (2/1)
##     :   :   :   Systolic > 0.6181818:
##     :   :   :   :...isSmoker <= 0:
##     :   :   :       :...Age <= 0.02564103: Low risk (2)
##     :   :   :       :   Age > 0.02564103:
##     :   :   :       :   :...isBlack <= 0: Intermediate risk (5)
##     :   :   :       :       isBlack > 0: High risk (4/1)
##     :   :   :       isSmoker > 0:
##     :   :   :       :...isBlack > 0: High risk (14/1)
##     :   :   :           isBlack <= 0:
##     :   :   :           :...isMale > 0: High risk (4)
##     :   :   :               isMale <= 0:
##     :   :   :               :...Cholesterol <= 0.4857143: Intermediate risk (3)
##     :   :   :                   Cholesterol > 0.4857143: High risk (2)
##     :   :   Systolic <= 0.5454546:
##     :   :   :...isSmoker <= 0:
##     :   :       :...isDiabetic <= 0:
##     :   :       :   :...Systolic <= 0.3090909: Low risk (14)
##     :   :       :   :   Systolic > 0.3090909:
##     :   :       :   :   :...Age <= 0.07692308: Low risk (5)
##     :   :       :   :       Age > 0.07692308: Borderline risk (11)
##     :   :       :   isDiabetic > 0:
##     :   :       :   :...isMale <= 0: Intermediate risk (3)
##     :   :       :       isMale > 0:
##     :   :       :       :...Cholesterol <= 0.4571429: Low risk (3/1)
##     :   :       :           Cholesterol > 0.4571429:
##     :   :       :           :...isHypertensive <= 0: Borderline risk (5)
##     :   :       :               isHypertensive > 0:
##     :   :       :               :...isBlack <= 0: Borderline risk (4)
##     :   :       :                   isBlack > 0: Intermediate risk (2)
##     :   :       isSmoker > 0:
##     :   :       :...Systolic > 0.3272727:
##     :   :           :...Systolic <= 0.3727273: Low risk (3/1)
##     :   :           :   Systolic > 0.3727273: Intermediate risk (9)
##     :   :           Systolic <= 0.3272727:
##     :   :           :...Age <= 0: Low risk (3/1)
##     :   :               Age > 0:
##     :   :               :...Age > 0.2051282: Intermediate risk (3/1)
##     :   :                   Age <= 0.2051282:
##     :   :                   :...HDL <= 0.0125: Intermediate risk (2)
##     :   :                       HDL > 0.0125:
##     :   :                       :...isDiabetic <= 0: Borderline risk (12)
##     :   :                           isDiabetic > 0:
##     :   :                           :...Systolic <= 0.1: Borderline risk (6)
##     :   :                               Systolic > 0.1: Intermediate risk (2)
##     :   HDL > 0.25:
##     :   :...isBlack <= 0:
##     :       :...isSmoker <= 0:
##     :       :   :...Age <= 0.2307692: Low risk (82)
##     :       :   :   Age > 0.2307692:
##     :       :   :   :...isDiabetic <= 0: Low risk (15)
##     :       :   :       isDiabetic > 0:
##     :       :   :       :...isMale <= 0: Low risk (6)
##     :       :   :           isMale > 0:
##     :       :   :           :...Systolic <= 0.1454545: Low risk (2)
##     :       :   :               Systolic > 0.1454545: Borderline risk (14)
##     :       :   isSmoker > 0:
##     :       :   :...HDL > 0.8125:
##     :       :       :...Cholesterol <= 0.7428572: Low risk (29)
##     :       :       :   Cholesterol > 0.7428572:
##     :       :       :   :...isMale <= 0: Low risk (3)
##     :       :       :       isMale > 0: Intermediate risk (2)
##     :       :       HDL <= 0.8125:
##     :       :       :...Systolic <= 0.3090909:
##     :       :           :...Age <= 0.1794872: Low risk (20)
##     :       :           :   Age > 0.1794872:
##     :       :           :   :...isDiabetic <= 0: Low risk (2)
##     :       :           :       isDiabetic > 0: Borderline risk (6)
##     :       :           Systolic > 0.3090909:
##     :       :           :...Cholesterol <= 0.2285714:
##     :       :               :...isMale <= 0: Low risk (6)
##     :       :               :   isMale > 0: Intermediate risk (3/1)
##     :       :               Cholesterol > 0.2285714:
##     :       :               :...isMale <= 0:
##     :       :                   :...isHypertensive <= 0: Borderline risk (20/1)
##     :       :                   :   isHypertensive > 0: Low risk (5)
##     :       :                   isMale > 0:
##     :       :                   :...HDL > 0.55: Borderline risk (27)
##     :       :                       HDL <= 0.55: [S1]
##     :       isBlack > 0:
##     :       :...Systolic > 0.5363637:
##     :           :...Age <= 0.1025641:
##     :           :   :...HDL <= 0.625:
##     :           :   :   :...Systolic <= 0.8727273: Intermediate risk (9/1)
##     :           :   :   :   Systolic > 0.8727273: High risk (5/1)
##     :           :   :   HDL > 0.625:
##     :           :   :   :...isDiabetic <= 0: Borderline risk (15/1)
##     :           :   :       isDiabetic > 0:
##     :           :   :       :...isSmoker <= 0: Intermediate risk (2)
##     :           :   :           isSmoker > 0: Borderline risk (5)
##     :           :   Age > 0.1025641:
##     :           :   :...isDiabetic <= 0:
##     :           :       :...isHypertensive > 0: Intermediate risk (8/1)
##     :           :       :   isHypertensive <= 0:
##     :           :       :   :...isMale > 0: Intermediate risk (2)
##     :           :       :       isMale <= 0:
##     :           :       :       :...Cholesterol <= 0.6428571: Low risk (9)
##     :           :       :           Cholesterol > 0.6428571: Intermediate risk (2)
##     :           :       isDiabetic > 0:
##     :           :       :...isSmoker <= 0: Intermediate risk (8/1)
##     :           :           isSmoker > 0:
##     :           :           :...isMale > 0: High risk (6)
##     :           :               isMale <= 0: [S2]
##     :           Systolic <= 0.5363637:
##     :           :...Cholesterol > 0.8285714:
##     :               :...isHypertensive <= 0: Borderline risk (13/1)
##     :               :   isHypertensive > 0: Intermediate risk (4/1)
##     :               Cholesterol <= 0.8285714:
##     :               :...isMale <= 0:
##     :                   :...isDiabetic <= 0:
##     :                   :   :...Cholesterol <= 0.7857143: Low risk (29)
##     :                   :   :   Cholesterol > 0.7857143:
##     :                   :   :   :...Age <= 0.2307692: Low risk (2)
##     :                   :   :       Age > 0.2307692: Borderline risk (2)
##     :                   :   isDiabetic > 0:
##     :                   :   :...Systolic <= 0.3272727: Low risk (11)
##     :                   :       Systolic > 0.3272727:
##     :                   :       :...Age <= 0.07692308: Low risk (2)
##     :                   :           Age > 0.07692308: Intermediate risk (3)
##     :                   isMale > 0:
##     :                   :...Systolic <= 0.09090909: Low risk (9)
##     :                       Systolic > 0.09090909:
##     :                       :...isDiabetic > 0: Intermediate risk (7)
##     :                           isDiabetic <= 0:
##     :                           :...Age > 0.2820513: Intermediate risk (5)
##     :                               Age <= 0.2820513:
##     :                               :...Systolic <= 0.2545455: Borderline risk (17/1)
##     :                                   Systolic > 0.2545455: Low risk (9/1)
##     Age > 0.3333333:
##     :...Systolic <= 0.2545455:
##         :...Cholesterol > 0.8285714:
##         :   :...isDiabetic > 0: Intermediate risk (9)
##         :   :   isDiabetic <= 0:
##         :   :   :...Age <= 0.4615385: Low risk (3)
##         :   :       Age > 0.4615385: Intermediate risk (2)
##         :   Cholesterol <= 0.8285714:
##         :   :...HDL > 0.8375:
##         :       :...Age <= 0.4615385: Low risk (9)
##         :       :   Age > 0.4615385: Intermediate risk (2)
##         :       HDL <= 0.8375:
##         :       :...isMale > 0:
##         :           :...isHypertensive > 0: Intermediate risk (8/1)
##         :           :   isHypertensive <= 0:
##         :           :   :...HDL <= 0.2125:
##         :           :       :...Cholesterol <= 0.7857143: Intermediate risk (6)
##         :           :       :   Cholesterol > 0.7857143: Borderline risk (3)
##         :           :       HDL > 0.2125:
##         :           :       :...isDiabetic > 0: Borderline risk (9/1)
##         :           :           isDiabetic <= 0:
##         :           :           :...HDL <= 0.2375: Borderline risk (5)
##         :           :               HDL > 0.2375: Low risk (4)
##         :           isMale <= 0:
##         :           :...Systolic <= 0.09090909:
##         :               :...isDiabetic <= 0: Low risk (8)
##         :               :   isDiabetic > 0: Intermediate risk (2)
##         :               Systolic > 0.09090909:
##         :               :...Cholesterol <= 0.2285714:
##         :                   :...Systolic <= 0.1909091: Low risk (5)
##         :                   :   Systolic > 0.1909091: Borderline risk (3)
##         :                   Cholesterol > 0.2285714:
##         :                   :...Systolic > 0.2181818: Low risk (3/1)
##         :                       Systolic <= 0.2181818:
##         :                       :...HDL > 0.475: Borderline risk (20)
##         :                           HDL <= 0.475:
##         :                           :...Age <= 0.4102564: Borderline risk (5/1)
##         :                               Age > 0.4102564: Intermediate risk (2)
##         Systolic > 0.2545455:
##         :...HDL <= 0.2:
##             :...isSmoker > 0:
##             :   :...Systolic <= 0.3545454: Intermediate risk (3)
##             :   :   Systolic > 0.3545454: High risk (13/2)
##             :   isSmoker <= 0:
##             :   :...isMale <= 0: Intermediate risk (12/1)
##             :       isMale > 0:
##             :       :...isDiabetic <= 0: Intermediate risk (7/1)
##             :           isDiabetic > 0: High risk (4)
##             HDL > 0.2:
##             :...isMale > 0:
##                 :...Cholesterol > 0.9285714:
##                 :   :...isHypertensive <= 0: Intermediate risk (2)
##                 :   :   isHypertensive > 0: Borderline risk (7)
##                 :   Cholesterol <= 0.9285714:
##                 :   :...isDiabetic <= 0: Intermediate risk (34/3)
##                 :       isDiabetic > 0:
##                 :       :...HDL <= 0.6875: High risk (10/1)
##                 :           HDL > 0.6875:
##                 :           :...isSmoker > 0:
##                 :               :...Cholesterol <= 0.3142857: Intermediate risk (3)
##                 :               :   Cholesterol > 0.3142857: High risk (3)
##                 :               isSmoker <= 0:
##                 :               :...Cholesterol > 0.2571429: Intermediate risk (6)
##                 :                   Cholesterol <= 0.2571429: [S3]
##                 isMale <= 0:
##                 :...Cholesterol > 0.8142857:
##                     :...Age <= 0.4102564: Low risk (6/1)
##                     :   Age > 0.4102564:
##                     :   :...Systolic <= 0.5909091: Intermediate risk (3)
##                     :       Systolic > 0.5909091: High risk (4)
##                     Cholesterol <= 0.8142857:
##                     :...isHypertensive > 0:
##                         :...isDiabetic > 0:
##                         :   :...Systolic <= 0.8272727: Intermediate risk (12)
##                         :   :   Systolic > 0.8272727: High risk (2)
##                         :   isDiabetic <= 0:
##                         :   :...Cholesterol <= 0.2142857: Borderline risk (7)
##                         :       Cholesterol > 0.2142857:
##                         :       :...HDL <= 0.55: Intermediate risk (5)
##                         :           HDL > 0.55: Low risk (4)
##                         isHypertensive <= 0:
##                         :...HDL > 0.95: Low risk (2)
##                             HDL <= 0.95:
##                             :...Cholesterol > 0.4142857:
##                                 :...Systolic > 0.7090909: Intermediate risk (6)
##                                 :   Systolic <= 0.7090909:
##                                 :   :...Age <= 0.5128205: Borderline risk (9/1)
##                                 :       Age > 0.5128205: Intermediate risk (2)
##                                 Cholesterol <= 0.4142857:
##                                 :...isBlack <= 0: Borderline risk (22)
##                                     isBlack > 0:
##                                     :...Cholesterol > 0.3285714: Borderline risk (4)
##                                         Cholesterol <= 0.3285714:
##                                         :...Age <= 0.4358974: Intermediate risk (2)
##                                             Age > 0.4358974: Low risk (2/1)
## 
## SubTree [S1]
## 
## Cholesterol <= 0.8714285: Intermediate risk (5)
## Cholesterol > 0.8714285: Borderline risk (3)
## 
## SubTree [S2]
## 
## isHypertensive <= 0: Intermediate risk (2)
## isHypertensive > 0: High risk (2)
## 
## SubTree [S3]
## 
## isHypertensive <= 0: Intermediate risk (2)
## isHypertensive > 0: Borderline risk (4)
## 
## 
## Evaluation on training data (1272 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##     155   46( 3.6%)   <<
## 
## 
##     (a)   (b)   (c)   (d)    <-classified as
##    ----  ----  ----  ----
##     317     3     2          (a): class Low risk
##           304           1    (b): class Borderline risk
##       7     6   302     9    (c): class Intermediate risk
##       1          17   303    (d): class High risk
## 
## 
##  Attribute usage:
## 
##  100.00% Age
##   89.23% Systolic
##   78.38% HDL
##   62.74% isSmoker
##   53.46% isDiabetic
##   45.60% isMale
##   43.08% Cholesterol
##   42.77% isBlack
##   22.33% isHypertensive
## 
## 
## Time: 0.0 secs
# Calculate performance metrics
accuracy_I3 <- sum((diag(conf_matrix)) / sum(conf_matrix))
error_rate_I3 <-( 1 - accuracy_I3)
sensitivity_I3 <- conf_matrix[4, 4] / sum(conf_matrix[4, ])
specificity_I3 <- sum(diag(conf_matrix[-4, -4])) / sum(conf_matrix[-4, ])
precision_I3 <- conf_matrix[4, 4] / sum(conf_matrix[, 4])

# Display performance metrics
cat("Accuracy: ", accuracy_I3, "\n")
## Accuracy:  0.7753165
cat("Error Rate: ", error_rate_I3, "\n")
## Error Rate:  0.2246835
cat("Sensitivity (Recall): ", sensitivity_I3, "\n")
## Sensitivity (Recall):  0.7638889
cat("Specificity: ", specificity_I3, "\n")
## Specificity:  0.7786885
cat("Precision: ", precision_I3, "\n")   
## Precision:  0.7236842

Analysis: The C5 model achieved an accuracy of 64.42%, showcasing its ability to make correct predictions across all classes. It exhibits strong sensitivity (73.03%), effectively identifying instances of high risk. However, the model’s specificity (57.98%) suggests potential for improvement in correctly identifying non-high-risk instances. The precision of 84.42% reflects the accuracy of positive predictions. The tree structure, comprising 92 nodes, indicates a moderate level of complexity. While the model performs reasonably well, there may be opportunities for refinement, particularly in specificity. Overall,The model’s strength lies in identifying clear cases (Low and High risk) .

the root of this tree is the age,

For individuals older than 0.5641026:

  • Higher systolic blood pressure (greater than 0.4909091) generally indicates higher risk, with smoking and being male increasing the likelihood of being at ‘High risk’.

  • Diabetic individuals within this age and systolic blood pressure range are also more likely to be at ‘High risk’.

For individuals younger than or equal to 0.5641026:

  • Those with low HDL (less than or equal to 0.25) have varying risk levels mainly influenced by systolic blood pressure and diabetes status, with higher systolic levels and diabetes presence increasing the risk.

After we have created a decision tree using the Information gain of three different sizes, we will now calculate the comparison between the three models

# Create data frames for each model's summary
summary1 <- data.frame(
  Model = "60%training 40%testing",
  Accuracy = 78.37,
  Sensitivity = 80.6,
  Specificity = 77.6,
  Precision = 72.26
)

summary2 <- data.frame(
  Model = "70%training 30%testing",
  Accuracy =  78.07,
  Sensitivity = 77.23,
  Specificity = 78.31,
  Precision = 72.90
)

summary3 <- data.frame(
  Model = "80%training 20%testing",
  Accuracy = 82.8,
  Sensitivity = 73.03,
  Specificity = 57.98,
  Precision = 84.42
)

# Combine the summaries into a single data frame
comparison_table <- rbind(summary1, summary2, summary3)

# Print the comparison table
print(comparison_table)
##                    Model Accuracy Sensitivity Specificity Precision
## 1 60%training 40%testing    78.37       80.60       77.60     72.26
## 2 70%training 30%testing    78.07       77.23       78.31     72.90
## 3 80%training 20%testing    82.80       73.03       57.98     84.42

Analysis:

  • All three C5 models exhibit distinctive performance metrics. The 60% training 40% testing model achieves an accuracy of 64.90%, with a sensitivity of 75.20% and specificity of 57.63%. The 70% training 30% testing model shows an accuracy of 64.42%, sensitivity of 73.03%, and specificity of 57.98%. The 80% training 20% testing model has an accuracy of 64.42%, sensitivity of 73.03%, and specificity of 57.98%. While accuracies are comparable, the 70% training 30% testing model displays slightly higher sensitivity and precision, making it a potential choice.

Conclusion:

  • Selecting the optimal C5 model depends on specific objectives. If prioritizing sensitivity and precision, the 70% training 30% testing model stands out. However, overall performance improvements could be achieved by refining model parameters or exploring additional features. Further optimization is recommended to enhance specificity across all models.

Decision Tree using Gini Index

Opting for RPART with the Gini index involves building a decision tree that maximizes class separation by minimizing impurity. This method, rooted in recursive partitioning, aims to create nodes that group similar instances based on the Gini impurity criterion.

1-partition the data into ( 60% training, 40% testing):

set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.60 , 0.40))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
#train using the trainData and create the rpart gini index tree
library('rpart')
library('rpart.plot')
library(caret)
tree <- rpart(myFormula, data = trainData,method = 'class')
rpart.plot(tree) 

# Make predictions using the RPART model on the test data
test_pred <- predict(tree, newdata = testData, type = "class")

# Create a confusion matrix
conf_matrix_rpart <- table(test_pred, testData$Risk)

# Display the confusion matrix
print(conf_matrix_rpart)
##                    
## test_pred           Low risk Borderline risk Intermediate risk High risk
##   Low risk               107              48                16         4
##   Borderline risk         31              71                28         4
##   Intermediate risk       11              53                65        36
##   High risk                2               4                38       111
# Calculate performance metrics
accuracy_D1 <- sum(diag(conf_matrix)) / sum(conf_matrix)
error_rate_D1 <- 1 - accuracy_D1
sensitivity_D1 <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity_D1 <- sum(diag(conf_matrix[-2, -2])) / sum(conf_matrix[-2, ])
precision_D1 <- conf_matrix[2, 2] / sum(conf_matrix[, 2])


# Display performance metrics
cat("Accuracy: ", accuracy_D1, "\n")
## Accuracy:  0.7753165
cat("Error Rate: ", error_rate_D1, "\n")
## Error Rate:  0.2246835
cat("Sensitivity (Recall): ", sensitivity_D1, "\n")
## Sensitivity (Recall):  0.8490566
cat("Specificity: ", specificity_D1, "\n")
## Specificity:  0.7380952
cat("Precision: ", precision_D1, "\n")
## Precision:  0.9782609

Analysis:

The results obtained from the rpart model showcase a balanced performance across various risk categories. The model achieved an overall accuracy of 57.39%, indicating its ability to make correct predictions across all classes. Sensitivity, measuring the model’s capability to identify positive instances, is at 50.81%, demonstrating a reasonable ability to detect true positives. Specificity stands at 60.14%, indicating the model’s proficiency in correctly identifying negative instances. The precision of 53.41% signifies the accuracy of positive predictions.

  1. Root Node: The root node of the tree is based on the age attribute, indicating that age is a primary factor in assessing risk. The threshold value for the split is 0.58; individuals above this threshold are classified into different risk categories primarily based on their systolic blood pressure and diabetes status.

  2. Age-Based Stratification: There’s a clear stratification by age with two main branches: one for individuals with Age <= 0.58 and another for Age > 0.58. This suggests that age is a significant determinant of risk level in this model.

  3. Systolic Blood Pressure: Within the older age group (Age > 0.58), systolic blood pressure is the next significant factor. Those with a systolic pressure above 0.42 are considered ‘High risk’,

2-partition the data into ( 70% training, 30% testing):

set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.70 , 0.30))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
#train using the trainData and create the rpart gini index tree
library('rpart')
library('rpart.plot')
tree <- rpart(myFormula, data = trainData,method = 'class')
rpart.plot(tree) 

# Make predictions using the RPART model on the test data
test_pred <- predict(tree, newdata = testData, type = "class")

# Create a confusion matrix
conf_matrix <- table(test_pred, testData$Risk)

# Display the confusion matrix
print(conf_matrix)
##                    
## test_pred           Low risk Borderline risk Intermediate risk High risk
##   Low risk                75              12                15         3
##   Borderline risk         34              81                21         1
##   Intermediate risk        5              31                54        37
##   High risk                2               2                17        66
# Calculate performance metrics
accuracy_D2 <- sum(diag(conf_matrix)) / sum(conf_matrix)
error_rate_D2 <- 1 - accuracy_D2
sensitivity_D2 <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity_D2 <- sum(diag(conf_matrix[-2, -2])) / sum(conf_matrix[-2, ])
precision_D2 <- conf_matrix[2, 2] / sum(conf_matrix[, 2])


# Display performance metrics
cat("Accuracy: ", accuracy_D2, "\n")
## Accuracy:  0.6052632
cat("Error Rate: ", error_rate_D2, "\n")
## Error Rate:  0.3947368
cat("Sensitivity (Recall): ", sensitivity_D2, "\n")
## Sensitivity (Recall):  0.5912409
cat("Specificity: ", specificity_D2, "\n")
## Specificity:  0.6112853
cat("Precision: ", precision_D2, "\n")
## Precision:  0.6428571

Analysis:

The results from the RPART model reveal a well-balanced performance across different risk categories. The model achieved an overall accuracy of 60.31%, indicating its proficiency in making accurate predictions across all classes. Notably, it demonstrated a sensitivity of 55.10%, effectively identifying instances of low risk, and a specificity of 62.78%, accurately recognizing non-low-risk instances. The precision of 64.29% underscores the model’s accuracy in positive predictions.

  1. Age as a Primary Factor: The tree splits initially on age, with the first division at 0.58. This suggests that age is a significant determinant in assessing risk levels in this model.

  2. Systolic Blood Pressure: Among older individuals (age > 0.58), systolic blood pressure is a critical factor for risk classification. Higher systolic pressure tends to lead to a higher risk assessment.

  3. Diabetes and Smoking Status: At higher systolic levels, being diabetic or a smoker substantially increases the risk, often resulting in a ‘High risk’ classification. For instance, individuals with Age > 0.58 and Systolic > 0.42 who are diabetic or smokers are mostly classified as ‘High risk’.

3-partition the data into ( 80% training, 20% testing):

set.seed(1234)
ind=sample (2, nrow(balanced_data), replace=TRUE, prob=c(0.80 , 0.20))
trainData=balanced_data[ind==1,]
testData=balanced_data[ind==2,]
#train using the trainData and create the rpart gini index tree
library('rpart')
library('rpart.plot')
tree <- rpart(myFormula, data = trainData,method = 'class')
rpart.plot(tree) 

# Make predictions using the RPART model on the test data
test_pred <- predict(tree, newdata = testData, type = "class")

# Create a confusion matrix
conf_matrix <- table(test_pred, testData$Risk)

# Display the confusion matrix
print(conf_matrix)
##                    
## test_pred           Low risk Borderline risk Intermediate risk High risk
##   Low risk                50              28                 9         3
##   Borderline risk         19              41                16         0
##   Intermediate risk        4              22                36        28
##   High risk                2               1                12        45
# Calculate performance metrics
accuracy_D3 <- sum(diag(conf_matrix)) / sum(conf_matrix)
error_rate_D3 <- 1 - accuracy_D3
sensitivity_D3 <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
specificity_D3 <- sum(diag(conf_matrix[-2, -2])) / sum(conf_matrix[-2, ])
precision_D3 <- conf_matrix[2, 2] / sum(conf_matrix[, 2])


# Display performance metrics
cat("Accuracy: ", accuracy_D3, "\n")
## Accuracy:  0.5443038
cat("Error Rate: ", error_rate_D3, "\n")
## Error Rate:  0.4556962
cat("Sensitivity (Recall): ", sensitivity_D3, "\n")
## Sensitivity (Recall):  0.5394737
cat("Specificity: ", specificity_D3, "\n")
## Specificity:  0.5458333
cat("Precision: ", precision_D3, "\n") 
## Precision:  0.4456522

Analysis:

The outcomes of the RPART model showcase a discernible performance across distinct risk categories. The model achieved an overall accuracy of 54.75%, highlighting its capability to make correct predictions across all classes. Specifically, it demonstrated a sensitivity of 51.02%, effectively identifying instances of low risk, and a specificity of 56.42%, accurately recognizing non-low-risk instances. The precision of 54.35% emphasizes the model’s accuracy in positive predictions.

  1. Age as a Primary Split: The tree splits first on age, with a cutoff at 0.58, indicating the prominence of age as a determinant in risk classification.

  2. Systolic Blood Pressure: For individuals above the age cutoff, systolic blood pressure is the next discriminator, particularly for the ‘High risk’ category (systolic < 0.42).

  3. Diabetes Status: For those who are not in the ‘High risk’ category by blood pressure alone, diabetes status is used to further stratify the risk, especially for those within the intermediate age range (Age < 0.71).

After we have created a decision tree using the Gini index of three different sizes, we will now calculate the comparison between the three models

# Create data frames for each summary
summary1 <- data.frame(
  Model = "60% training 40% testing",
  Accuracy = 57.39,
  Sensitivity = 50.81,
  Specificity = 60.14,
  Precision = 53.41
)

summary2 <- data.frame(
  Model = "70% training, 30% testing",
  Accuracy = 60.31,
  Sensitivity = 55.10,
  Specificity = 62.78,
  Precision = 64.29
)

summary3 <- data.frame(
  Model = " 80% training 20% testing",
  Accuracy = 54.75,
  Sensitivity = 51.02,
  Specificity = 56.42,
  Precision = 54.35
)


# Combine summaries into a single data frame
comparison_table <- rbind(summary1, summary2, summary3)

# Print the comparison table
print(comparison_table)
##                       Model Accuracy Sensitivity Specificity Precision
## 1  60% training 40% testing    57.39       50.81       60.14     53.41
## 2 70% training, 30% testing    60.31       55.10       62.78     64.29
## 3  80% training 20% testing    54.75       51.02       56.42     54.35

Observations:

  • The model trained with 70% of the data for training and 30% for testing exhibits the highest overall performance with the highest accuracy, sensitivity, specificity, and precision.

  • The 60% training and 40% testing model follows closely with competitive metrics across all categories.

  • The 80% training and 20% testing model lags behind in accuracy and precision but maintains moderate performance in sensitivity and specificity.

Conclusion: Considering the three models, the 70% training and 30% testing model stands out as the most effective, striking a balance between accuracy, sensitivity, specificity, and precision. It outperforms the other two models, demonstrating its robustness in handling different proportions of training and testing data. the decision tree suggests a hierarchical model where age is the most significant predictor, followed by systolic blood pressure, diabetic status, and smoking status.

Classification conclusion:

the C4.5 model using information Gain emerged as the preferred choice. The C4.5 model exhibited superior predictive performance with a higher accuracy of 82.8% in the (80% training, 20% testing) partitioning , sensitivity, specificity, and precision compared to the other models. The decision to favor C4.5 is grounded in its ability to capture both positive and negative instances effectively, making it well-suited for the dataset characteristics. The model’s strength lies in identifying clear cases (Low and High risk). Age is the primary split, indicating its importance as a predictor. For individuals older than 0.5641026 (normalized value), the risk generally increases.

7- Clustering

Clustering models are utilized to group data into distinct clusters or groups. In our case, we will apply the k-means clustering algorithm to our dataset and interpret the results, taking into consideration our knowledge of the class label.

Certain factors can impact the efficacy of the final clusters formed when using k-means clustering that we have to be aware. For instance, outliers: Cluster formation is very sensitive to the presence of outliers as that they can pull the cluster towards itself, thus affecting optimal cluster formation. However, we have already addressed this concern in earlier steps.

First we have to remove target class:

cdataset = subset(dataset, select = -c(Risk))

We can now use the rest of the attributes for clustering.

Check our data set type:

The checking is because K-Means algorithm does not work with categorical data.

# 1- view
str(cdataset)
## 'data.frame':    1000 obs. of  9 variables:
##  $ isMale        : int  1 0 0 1 0 0 1 1 0 1 ...
##  $ isBlack       : int  1 0 1 1 0 0 0 0 0 0 ...
##  $ isSmoker      : int  0 0 1 1 1 1 1 1 1 0 ...
##  $ isDiabetic    : int  1 1 1 1 0 0 0 1 0 1 ...
##  $ isHypertensive: int  1 1 1 0 1 1 0 0 1 1 ...
##  $ Age           : num  0.2308 0.7436 0.2564 0.0513 0.6667 ...
##  $ Systolic      : num  0.1 0.7 0.827 0.5 0.4 ...
##  $ Cholesterol   : num  0.729 0.357 0.243 0.514 0.986 ...
##  $ HDL           : num  0.15 0.487 0.487 0.325 0.537 ...

It’s clear that all 9 variables are numeric of type integer so we can start working on it with no issues.

Determining the optimal number of clusters:

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
cdataset <- scale(cdataset)
fviz_nbclust(cdataset, kmeans, method = "silhouette")+ labs(subtitle = "silhouette method")

According to silhouette method best number of clusters is K = 2 so will test it along with other high points such as k=4 , k=8.

Clustering K= 2:

As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data:
# 2- prepreocessing 
#Data types should be transformed into numeric types before clustering.
cdataset <- scale(cdataset)

K-means:

K-means algorithm is non-deterministic, meaning that the clustering outcome can vary each time the algorithm is executed, even when applied to the same dataset. To address this, we will set a seed for the random number generation, ensuring that the results can be reproduced consistently.

# 3- run k-means clustering to find 2 clusters
#set a seed for random number generation  to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(cdataset,2)
# print the clusterng result
kmeans.result
## K-means clustering with 2 clusters of sizes 516, 484
## 
## Cluster means:
##        isMale     isBlack   isSmoker  isDiabetic isHypertensive         Age
## 1 -0.02262886 -0.04843516  0.9680116  0.02577952   -0.001627174 -0.02976937
## 2  0.02412499  0.05163749 -1.0320124 -0.02748395    0.001734756  0.03173759
##      Systolic Cholesterol          HDL
## 1  0.04730009 -0.01946460 -0.007645875
## 2 -0.05042737  0.02075152  0.008151387
## 
## Clustering vector:
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##    2    2    1    1    1    1    1    1    1    2    1    2    2    1    2    1 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##    2    1    1    2    1    2    1    1    2    2    2    2    1    2    2    2 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##    2    1    1    2    1    2    1    2    1    1    2    2    2    2    1    1 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##    2    1    1    1    2    1    1    2    1    2    2    2    1    1    1    2 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
##    1    2    2    1    1    2    1    1    1    1    2    2    2    2    1    1 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
##    1    1    2    2    2    2    2    2    1    2    1    1    2    1    1    1 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
##    2    2    1    1    2    1    2    1    1    2    2    2    2    1    1    1 
##  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128 
##    1    1    2    2    2    2    1    1    1    2    1    2    2    1    2    2 
##  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144 
##    1    1    2    2    1    1    2    2    1    1    2    1    2    2    1    1 
##  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160 
##    2    2    1    2    2    1    1    1    1    2    1    1    2    2    2    1 
##  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  176 
##    2    1    1    1    1    1    2    2    1    1    2    1    2    1    1    1 
##  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  192 
##    1    2    2    2    2    2    2    2    1    2    2    2    1    1    2    2 
##  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207  208 
##    2    2    2    1    1    2    2    2    2    2    2    2    1    1    2    1 
##  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  224 
##    1    2    1    1    1    1    1    2    2    1    2    1    2    1    1    1 
##  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  240 
##    1    1    2    2    1    1    2    2    2    2    2    1    1    2    1    1 
##  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255  256 
##    1    2    1    2    2    2    2    1    2    2    2    2    1    1    2    1 
##  257  258  259  260  261  262  263  264  265  266  267  268  269  270  271  272 
##    1    1    2    1    2    1    2    2    2    2    1    2    2    1    1    2 
##  273  274  275  276  277  278  279  280  281  282  283  284  285  286  287  288 
##    2    2    2    1    1    1    1    1    1    2    1    2    2    1    1    2 
##  289  290  291  292  293  294  295  296  297  298  299  300  301  302  303  304 
##    2    2    1    2    1    1    1    1    2    2    2    2    1    1    2    1 
##  305  306  307  308  309  310  311  312  313  314  315  316  317  318  319  320 
##    2    2    1    1    1    1    1    1    2    1    1    1    1    1    1    2 
##  321  322  323  324  325  326  327  328  329  330  331  332  333  334  335  336 
##    1    2    1    2    2    1    1    1    2    2    1    2    2    1    2    2 
##  337  338  339  340  341  342  343  344  345  346  347  348  349  350  351  352 
##    1    1    2    2    2    2    2    2    1    1    2    2    2    1    1    2 
##  353  354  355  356  357  358  359  360  361  362  363  364  365  366  367  368 
##    2    2    1    1    2    2    2    2    2    2    2    2    1    1    2    1 
##  369  370  371  372  373  374  375  376  377  378  379  380  381  382  383  384 
##    2    1    2    2    1    2    2    1    1    2    1    2    2    2    2    1 
##  385  386  387  388  389  390  391  392  393  394  395  396  397  398  399  400 
##    1    2    2    2    1    1    1    2    1    1    1    1    2    1    2    1 
##  401  402  403  404  405  406  407  408  409  410  411  412  413  414  415  416 
##    1    2    1    1    2    2    1    1    1    1    2    2    2    2    1    1 
##  417  418  419  420  421  422  423  424  425  426  427  428  429  430  431  432 
##    1    1    1    1    2    2    1    2    2    1    2    2    2    1    1    2 
##  433  434  435  436  437  438  439  440  441  442  443  444  445  446  447  448 
##    1    2    1    2    2    1    1    1    1    2    1    2    1    1    2    2 
##  449  450  451  452  453  454  455  456  457  458  459  460  461  462  463  464 
##    1    1    2    1    2    1    2    1    2    1    1    2    1    1    1    2 
##  465  466  467  468  469  470  471  472  473  474  475  476  477  478  479  480 
##    1    1    1    1    1    1    1    1    1    1    2    2    2    1    2    1 
##  481  482  483  484  485  486  487  488  489  490  491  492  493  494  495  496 
##    1    2    2    1    1    1    2    1    1    2    1    2    1    1    2    1 
##  497  498  499  500  501  502  503  504  505  506  507  508  509  510  511  512 
##    1    1    2    2    2    2    2    2    2    2    2    2    2    1    1    2 
##  513  514  515  516  517  518  519  520  521  522  523  524  525  526  527  528 
##    1    1    2    1    2    2    1    1    1    1    2    1    2    2    2    2 
##  529  530  531  532  533  534  535  536  537  538  539  540  541  542  543  544 
##    2    2    1    2    1    2    2    1    2    1    1    1    1    1    2    1 
##  545  546  547  548  549  550  551  552  553  554  555  556  557  558  559  560 
##    2    1    2    2    2    2    2    1    1    2    2    1    1    1    1    2 
##  561  562  563  564  565  566  567  568  569  570  571  572  573  574  575  576 
##    1    1    2    1    2    1    1    2    2    1    1    2    2    2    2    2 
##  577  578  579  580  581  582  583  584  585  586  587  588  589  590  591  592 
##    1    1    2    1    1    2    1    2    2    2    2    1    2    2    1    1 
##  593  594  595  596  597  598  599  600  601  602  603  604  605  606  607  608 
##    2    2    2    1    1    2    2    1    1    1    1    1    2    1    1    1 
##  609  610  611  612  613  614  615  616  617  618  619  620  621  622  623  624 
##    1    2    1    2    1    1    1    2    1    2    1    1    1    2    1    2 
##  625  626  627  628  629  630  631  632  633  634  635  636  637  638  639  640 
##    2    1    1    2    2    2    1    2    2    1    2    1    2    2    2    1 
##  641  642  643  644  645  646  647  648  649  650  651  652  653  654  655  656 
##    1    1    2    2    2    2    1    2    1    2    2    2    1    1    2    1 
##  657  658  659  660  661  662  663  664  665  666  667  668  669  670  671  672 
##    1    1    2    1    2    1    2    1    1    1    1    1    1    1    1    2 
##  673  674  675  676  677  678  679  680  681  682  683  684  685  686  687  688 
##    2    2    2    2    1    1    2    1    1    2    1    2    2    1    1    2 
##  689  690  691  692  693  694  695  696  697  698  699  700  701  702  703  704 
##    1    1    2    1    1    2    1    1    2    1    2    1    2    1    2    2 
##  705  706  707  708  709  710  711  712  713  714  715  716  717  718  719  720 
##    1    2    2    2    2    2    1    1    2    1    2    1    1    2    2    1 
##  721  722  723  724  725  726  727  728  729  730  731  732  733  734  735  736 
##    1    1    2    2    1    1    1    1    1    1    2    2    2    1    1    1 
##  737  738  739  740  741  742  743  744  745  746  747  748  749  750  751  752 
##    1    2    2    2    1    1    1    2    1    2    1    2    2    1    2    1 
##  753  754  755  756  757  758  759  760  761  762  763  764  765  766  767  768 
##    2    2    1    1    1    1    1    1    1    1    2    2    1    1    2    1 
##  769  770  771  772  773  774  775  776  777  778  779  780  781  782  783  784 
##    2    2    2    1    2    1    2    1    2    2    1    2    2    1    2    2 
##  785  786  787  788  789  790  791  792  793  794  795  796  797  798  799  800 
##    2    1    2    1    1    1    1    2    1    2    2    2    1    1    2    1 
##  801  802  803  804  805  806  807  808  809  810  811  812  813  814  815  816 
##    1    2    1    2    2    1    1    2    1    2    2    2    1    2    1    1 
##  817  818  819  820  821  822  823  824  825  826  827  828  829  830  831  832 
##    2    1    1    2    1    1    2    1    1    2    2    2    1    1    2    2 
##  833  834  835  836  837  838  839  840  841  842  843  844  845  846  847  848 
##    1    2    2    1    1    1    2    1    1    2    1    1    2    1    2    2 
##  849  850  851  852  853  854  855  856  857  858  859  860  861  862  863  864 
##    1    2    1    2    2    2    2    1    2    1    2    1    2    2    1    1 
##  865  866  867  868  869  870  871  872  873  874  875  876  877  878  879  880 
##    2    2    2    1    1    1    2    1    1    1    2    2    1    1    1    2 
##  881  882  883  884  885  886  887  888  889  890  891  892  893  894  895  896 
##    1    2    2    1    2    2    2    1    1    2    1    1    2    2    1    1 
##  897  898  899  900  901  902  903  904  905  906  907  908  909  910  911  912 
##    1    1    2    1    1    2    2    1    2    2    2    1    1    2    1    1 
##  913  914  915  916  917  918  919  920  921  922  923  924  925  926  927  928 
##    1    2    2    1    1    1    2    1    2    1    1    1    2    1    1    2 
##  929  930  931  932  933  934  935  936  937  938  939  940  941  942  943  944 
##    2    1    1    1    2    2    2    1    1    1    1    2    1    1    2    2 
##  945  946  947  948  949  950  951  952  953  954  955  956  957  958  959  960 
##    2    2    2    2    1    2    2    2    2    2    1    1    1    1    1    1 
##  961  962  963  964  965  966  967  968  969  970  971  972  973  974  975  976 
##    1    2    1    1    1    1    1    1    1    2    1    1    1    2    1    2 
##  977  978  979  980  981  982  983  984  985  986  987  988  989  990  991  992 
##    1    2    1    2    1    2    2    2    1    2    2    2    1    2    1    1 
##  993  994  995  996  997  998  999 1000 
##    2    2    2    1    2    1    1    2 
## 
## Within cluster sum of squares by cluster:
## [1] 4105.816 3878.629
##  (between_SS / total_SS =  11.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

k-means algorithm is used to identify and assign the data to two clusters so that each observation will be assigned to one of the two clusters. From the output, we can observe that two different clusters have been found with sizes 516 and 484, and the within cluster sum of square (WCSS) =11.2% meaning the clusters are kind of compacted. But we need to visualize it to have a better look.

Cluster Plot:

# 4- visualize clustering and install package
library(factoextra)
fviz_cluster(kmeans.result, data = cdataset)

The plot shows overlapping clusters, particularly in the middle, suggesting that some data points are challenging to assign to a specific cluster. An avegrage silhouette coefficient can be more precise so we will calculate it.

Average Silhouette Coefficient:

The value is between [-1, 1], a score of 1 denotes the best. And the worst value is -1. Values near 0 denote overlapping clusters.

#Average silhouette
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster, dist(cdataset))
# k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
##   cluster size ave.sil.width
## 1       1  516          0.11
## 2       2  484          0.11

The Average Silhouette Coefficient of 0.11 suggests that there is a certain level of similarity among the data points within the clusters formed through the clustering process. However, the coefficient is relatively low, approaching zero, indicating the presence of overlapping clusters.

BCubed precision and recall:

To measure the quality of the cluster the average BCubed precision and recall of all objects in the data set is computed:

# Cluster assignments and ground truth labels
cluster_assignments <- kmeans.result$cluster
ground_truth <- dataset$Risk

# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(cluster_assignments, ground_truth) {
  n <- length(cluster_assignments)
  precision_sum <- 0
  recall_sum <- 0

  for (i in 1:n) {
    cluster <- cluster_assignments[i]
    label <- ground_truth[i]

    # Count the number of items from the same category within the same cluster
    same_category_same_cluster <- sum(ground_truth[cluster_assignments == cluster] == label)

    # Count the total number of items in the same cluster
    total_same_cluster <- sum(cluster_assignments == cluster)

    # Count the total number of items with the same category
    total_same_category <- sum(ground_truth == label)

    # Calculate precision and recall for the current item and add them to the sums
    precision_sum <- precision_sum + same_category_same_cluster / total_same_cluster
    recall_sum <- recall_sum + same_category_same_cluster / total_same_category
  }
  precision <- precision_sum / n  # Calculate average precision 
  recall <- recall_sum / n        # Calculate average recall

  return(list(precision = precision, recall = recall)) }

# Calculate BCubed precision and recall
precision_recall <- calculate_bcubed_metrics(cluster_assignments, ground_truth)

# Extract precision and recall from the metrics
precision <- precision_recall$precision
recall <- precision_recall$recall

# Print the results
cat(" BCubed Precision:", precision, "\n","BCubed Recall:", recall)
##  BCubed Precision: 0.3299589 
##  BCubed Recall: 0.5317886

The calculated precision value is 0.32996 not a high value. It means that the clusters are not pure; meaning not all data points in a cluster belong to the same category.

On the other hand, the calculated recall value of 0.53179 implies that approximately half of the objrcts belonging to the same categore are correctly assigned to the same cluster.

Conclusion of K=2:

Considering upove results for K=2 in isolation, without considering our knowledge of the class label, it is evident that the performance is suboptimal (less than ideal). Therefore, it is recommended to explore other values for K in order to achieve better clustering results.

Clustering K= 4:

scaling the data:
# 2- prepreocessing 
#Data types should be transformed into numeric types before clustering.
cdataset <- scale(cdataset)

K-means:

# 1- run k-means clustering to find 4 clusters
#set a seed for random number generation  to make the results reproducible
set.seed(8953)

kmeans_result <- kmeans(cdataset, centers = 4, nstart = 25)

#Accessing kmeans_result
print(kmeans_result)
## K-means clustering with 4 clusters of sizes 240, 255, 244, 261
## 
## Cluster means:
##         isMale      isBlack   isSmoker   isDiabetic isHypertensive         Age
## 1 -0.004998499  0.098461545 -1.0320124 -0.002334427     1.00954535  0.04092810
## 2 -0.101538140 -1.061382078  0.9680116  0.124685876     0.02175491 -0.01063463
## 3  0.052771040  0.005581038 -1.0320124 -0.052221191    -0.98955436  0.02269775
## 4  0.054466405  0.941225616  0.9680116 -0.070853124    -0.02447174 -0.04846424
##      Systolic Cholesterol          HDL
## 1 -0.03065348 -0.08081696 -0.004490818
## 2  0.08003760  0.02566201 -0.052055040
## 3 -0.06987709  0.12065493  0.020586343
## 4  0.01531517 -0.06355382  0.035742390
## 
## Clustering vector:
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##    1    1    4    4    2    2    2    2    2    1    2    1    1    4    1    2 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##    1    4    4    1    2    3    2    4    1    1    1    1    2    3    1    1 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##    1    4    4    1    2    1    2    3    4    2    3    3    3    3    4    2 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##    3    2    4    4    3    4    2    3    4    3    3    1    4    2    2    1 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
##    4    3    1    4    4    3    2    4    2    4    3    1    3    3    4    4 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
##    4    4    1    1    1    3    3    3    4    1    4    4    3    2    4    2 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
##    1    1    2    2    1    2    1    4    4    3    1    3    3    2    2    2 
##  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128 
##    4    2    1    3    3    1    4    2    4    3    4    1    3    4    3    3 
##  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144 
##    2    4    3    1    4    4    1    3    2    2    1    2    3    1    2    4 
##  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160 
##    1    3    2    3    3    2    2    4    4    3    4    2    3    1    1    2 
##  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  176 
##    1    4    2    4    4    2    1    1    2    4    1    2    3    2    2    2 
##  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  192 
##    4    1    1    3    3    1    3    3    4    1    1    1    2    2    1    3 
##  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207  208 
##    3    3    3    4    2    1    1    3    1    3    1    1    2    4    3    4 
##  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  224 
##    4    3    2    2    2    4    4    3    1    4    3    4    1    2    4    2 
##  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  240 
##    2    4    3    1    4    4    3    1    1    1    3    2    4    3    2    2 
##  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255  256 
##    2    3    4    3    3    3    3    4    3    1    1    3    4    2    1    2 
##  257  258  259  260  261  262  263  264  265  266  267  268  269  270  271  272 
##    2    4    3    4    3    2    1    1    1    3    2    3    1    4    2    1 
##  273  274  275  276  277  278  279  280  281  282  283  284  285  286  287  288 
##    1    3    1    4    2    4    4    4    4    1    2    3    3    2    2    3 
##  289  290  291  292  293  294  295  296  297  298  299  300  301  302  303  304 
##    3    1    4    3    4    2    4    2    3    1    1    1    2    4    1    2 
##  305  306  307  308  309  310  311  312  313  314  315  316  317  318  319  320 
##    3    1    2    2    2    2    2    2    1    4    4    2    2    4    4    1 
##  321  322  323  324  325  326  327  328  329  330  331  332  333  334  335  336 
##    2    1    2    1    1    2    2    4    1    3    4    3    3    4    3    1 
##  337  338  339  340  341  342  343  344  345  346  347  348  349  350  351  352 
##    2    2    3    1    3    3    3    1    4    4    3    1    3    2    2    1 
##  353  354  355  356  357  358  359  360  361  362  363  364  365  366  367  368 
##    1    1    2    2    3    1    1    1    1    1    1    3    4    4    3    4 
##  369  370  371  372  373  374  375  376  377  378  379  380  381  382  383  384 
##    1    2    3    1    2    1    3    2    2    3    4    3    1    1    3    2 
##  385  386  387  388  389  390  391  392  393  394  395  396  397  398  399  400 
##    2    1    1    3    4    4    2    3    2    2    2    4    3    2    3    4 
##  401  402  403  404  405  406  407  408  409  410  411  412  413  414  415  416 
##    2    3    4    2    3    1    2    2    4    4    3    3    1    3    2    4 
##  417  418  419  420  421  422  423  424  425  426  427  428  429  430  431  432 
##    2    4    4    2    3    1    2    1    1    2    3    3    1    2    4    3 
##  433  434  435  436  437  438  439  440  441  442  443  444  445  446  447  448 
##    4    1    2    1    1    4    2    2    2    3    2    3    2    4    1    3 
##  449  450  451  452  453  454  455  456  457  458  459  460  461  462  463  464 
##    4    2    3    2    3    4    1    2    1    4    4    1    4    2    4    3 
##  465  466  467  468  469  470  471  472  473  474  475  476  477  478  479  480 
##    2    2    4    2    2    2    4    4    4    4    3    3    3    2    1    2 
##  481  482  483  484  485  486  487  488  489  490  491  492  493  494  495  496 
##    4    3    1    2    2    4    3    4    2    3    4    1    2    2    1    2 
##  497  498  499  500  501  502  503  504  505  506  507  508  509  510  511  512 
##    2    2    1    3    3    3    1    3    3    3    3    3    1    4    2    3 
##  513  514  515  516  517  518  519  520  521  522  523  524  525  526  527  528 
##    4    4    3    2    3    1    2    2    2    4    3    4    3    1    1    1 
##  529  530  531  532  533  534  535  536  537  538  539  540  541  542  543  544 
##    1    3    4    3    4    1    3    2    1    4    4    4    2    4    3    4 
##  545  546  547  548  549  550  551  552  553  554  555  556  557  558  559  560 
##    1    2    3    1    1    1    1    4    2    1    1    4    2    4    4    1 
##  561  562  563  564  565  566  567  568  569  570  571  572  573  574  575  576 
##    4    2    1    4    1    4    4    1    3    2    4    3    1    3    1    3 
##  577  578  579  580  581  582  583  584  585  586  587  588  589  590  591  592 
##    2    2    1    2    2    1    4    1    1    3    3    4    1    3    4    2 
##  593  594  595  596  597  598  599  600  601  602  603  604  605  606  607  608 
##    3    1    1    4    4    3    1    4    2    4    2    4    3    2    2    2 
##  609  610  611  612  613  614  615  616  617  618  619  620  621  622  623  624 
##    4    1    2    1    2    4    4    1    4    3    2    4    4    3    4    3 
##  625  626  627  628  629  630  631  632  633  634  635  636  637  638  639  640 
##    3    4    2    1    1    1    2    1    1    2    3    2    3    3    3    2 
##  641  642  643  644  645  646  647  648  649  650  651  652  653  654  655  656 
##    2    2    3    3    1    1    2    3    2    3    3    1    4    4    1    4 
##  657  658  659  660  661  662  663  664  665  666  667  668  669  670  671  672 
##    2    4    3    4    3    4    3    4    4    4    2    4    4    4    4    1 
##  673  674  675  676  677  678  679  680  681  682  683  684  685  686  687  688 
##    1    3    1    3    2    4    1    2    4    1    4    3    3    4    2    1 
##  689  690  691  692  693  694  695  696  697  698  699  700  701  702  703  704 
##    2    4    1    2    4    1    4    4    1    2    3    4    3    4    3    3 
##  705  706  707  708  709  710  711  712  713  714  715  716  717  718  719  720 
##    4    3    1    3    1    1    4    4    3    2    3    2    2    3    1    4 
##  721  722  723  724  725  726  727  728  729  730  731  732  733  734  735  736 
##    4    4    1    1    2    2    2    4    4    2    1    3    1    2    4    4 
##  737  738  739  740  741  742  743  744  745  746  747  748  749  750  751  752 
##    2    1    1    3    4    2    4    3    4    3    2    1    1    4    3    2 
##  753  754  755  756  757  758  759  760  761  762  763  764  765  766  767  768 
##    3    1    4    2    2    4    2    4    4    2    1    3    2    2    3    4 
##  769  770  771  772  773  774  775  776  777  778  779  780  781  782  783  784 
##    3    3    3    2    3    4    3    4    1    3    4    1    1    4    3    3 
##  785  786  787  788  789  790  791  792  793  794  795  796  797  798  799  800 
##    3    2    3    4    2    2    4    1    4    3    3    3    2    2    3    4 
##  801  802  803  804  805  806  807  808  809  810  811  812  813  814  815  816 
##    2    1    4    3    3    2    2    3    4    3    3    1    2    1    4    2 
##  817  818  819  820  821  822  823  824  825  826  827  828  829  830  831  832 
##    1    4    2    3    4    4    1    4    4    3    1    1    2    2    3    1 
##  833  834  835  836  837  838  839  840  841  842  843  844  845  846  847  848 
##    2    3    3    2    4    2    3    4    4    1    4    4    3    4    1    1 
##  849  850  851  852  853  854  855  856  857  858  859  860  861  862  863  864 
##    2    1    4    1    3    1    3    2    3    4    3    2    3    3    2    4 
##  865  866  867  868  869  870  871  872  873  874  875  876  877  878  879  880 
##    3    3    1    4    2    2    1    4    2    4    3    3    4    2    4    3 
##  881  882  883  884  885  886  887  888  889  890  891  892  893  894  895  896 
##    4    1    1    2    3    1    1    2    4    3    4    4    1    1    4    2 
##  897  898  899  900  901  902  903  904  905  906  907  908  909  910  911  912 
##    2    4    3    2    2    1    1    4    3    3    1    4    2    3    4    4 
##  913  914  915  916  917  918  919  920  921  922  923  924  925  926  927  928 
##    2    3    1    2    4    2    3    2    1    2    2    2    1    4    2    3 
##  929  930  931  932  933  934  935  936  937  938  939  940  941  942  943  944 
##    1    2    4    2    3    1    3    4    4    2    2    1    4    2    3    1 
##  945  946  947  948  949  950  951  952  953  954  955  956  957  958  959  960 
##    1    3    1    1    4    1    1    1    1    1    4    2    4    4    4    2 
##  961  962  963  964  965  966  967  968  969  970  971  972  973  974  975  976 
##    4    3    2    4    2    2    4    4    2    3    4    4    4    1    2    1 
##  977  978  979  980  981  982  983  984  985  986  987  988  989  990  991  992 
##    4    1    4    3    2    3    1    3    2    3    1    3    4    1    4    2 
##  993  994  995  996  997  998  999 1000 
##    1    1    3    2    3    4    4    3 
## 
## Within cluster sum of squares by cluster:
## [1] 1648.259 1799.286 1739.876 1778.161
##  (between_SS / total_SS =  22.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

We can observe that four different clusters have been found with sizes 240 , 255 ,244 and 261. And the within cluster sum of square (WCSS) =22.5% which means that the cluster less compact and cohesive. Its higher than 2 clusters result which means 2 clusters are better in terms of compactness.

Cluster plot :

# 2- visualize clustering and install package
library(factoextra)
fviz_cluster(kmeans_result, data = cdataset)

As we can see In the cluster plot, it’s evident that there are overlapping clusters.

Average Silhouette Coefficient:

#3-Average silhouette
library(cluster)
avg_sil <- silhouette(kmeans_result$cluster, dist(cdataset))
# k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
##   cluster size ave.sil.width
## 1       1  240          0.13
## 2       2  255          0.12
## 3       3  244          0.12
## 4       4  261          0.13

An Average Silhouette coefficient of 0.12 indicate that the clustering is not very well-defined, and there is ambiguity and overlap between clusters. However, the result is higher than 2 clusters.

BCubed precision and recall:

# Cluster assignments and ground truth labels
cluster_assignments <- kmeans_result$cluster
ground_truth <- dataset$Risk

# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(cluster_assignments, ground_truth) {
  n <- length(cluster_assignments)
  precision_sum <- 0
  recall_sum <- 0

  for (i in 1:n) {
    cluster <- cluster_assignments[i]
    label <- ground_truth[i]

    # Count the number of items from the same category within the same cluster
    same_category_same_cluster <- sum(ground_truth[cluster_assignments == cluster] == label)

    # Count the total number of items in the same cluster
    total_same_cluster <- sum(cluster_assignments == cluster)

    # Count the total number of items with the same category
    total_same_category <- sum(ground_truth == label)

    # Calculate precision and recall for the current item and add them to the sums
    precision_sum <- precision_sum + same_category_same_cluster / total_same_cluster
    recall_sum <- recall_sum + same_category_same_cluster / total_same_category
  }
  precision <- precision_sum / n  # Calculate average precision 
  recall <- recall_sum / n        # Calculate average recall

  return(list(precision = precision, recall = recall)) }

# Calculate BCubed precision and recall
precision_recall <- calculate_bcubed_metrics(cluster_assignments, ground_truth)

# Extract precision and recall from the metrics
precision <- precision_recall$precision
recall <- precision_recall$recall

# Print the results
cat(" BCubed Precision:", precision, "\n","BCubed Recall:", recall)
##  BCubed Precision: 0.336335 
##  BCubed Recall: 0.2729542

The calculated precision value is 0.336335 not a high value it mean the clusters are not pure.and not all data points in a cluster belong to the same category.

The calculated recall value is 0.2729542 it’s a low result meaning most of the data are not in the same cluster.

Conclusion of K=4:

After applying various evaluation metrics such as the average silhouette coefficient, within-cluster sum of squares ,Bcubed precision and recall.it became clear to us that k=4 Is not a good number of clusters since there is overlapping and the clusters are not pure .And the within cluster sum of square 4 clusters has a higher value than 2 cluster indicating that the 4 clusters less compact .but According to the number of class label its the best among the considered options.

Clustering K=8 :

scaling the data:
# 2- prepreocessing 
#Data types should be transformed into numeric types before clustering.
cdataset <- scale(cdataset)

K-means:

# 3- run k-means clustering to find 8 clusters
#set a seed for random number generation  to make the results reproducible
set.seed(8953)
kmeansresult <- kmeans(cdataset,8)
# print the clusterng result
kmeansresult
## K-means clustering with 8 clusters of sizes 136, 149, 100, 132, 122, 93, 139, 129
## 
## Cluster means:
##       isMale    isBlack    isSmoker  isDiabetic isHypertensive         Age
## 1  0.6374557  0.9412256  0.96801163 -0.11758451      0.4803719 -0.42674815
## 2  0.9928563  0.1348064 -1.03201240 -0.06416429      0.3789569 -0.26292001
## 3  0.8197539 -0.2403129 -0.09200111 -0.06403000     -0.9895544  0.72578390
## 4 -0.9645589  0.5467726 -0.78958524  0.24399312      0.4189023  0.34795779
## 5 -0.6683239 -0.4868635  0.60735156 -0.12602627     -0.9895544 -0.76525097
## 6 -0.4852307  0.3598234  0.86048345  0.31098443     -0.7316060  0.75221709
## 7 -0.3324182 -0.2401689 -1.03201240 -0.23835629     -0.1410156 -0.05978714
## 8 -0.1272486 -1.0613821  0.96801163  0.14986867      1.0095454  0.08076629
##      Systolic Cholesterol         HDL
## 1  0.06575189 -0.11568304 -0.01941434
## 2 -0.44527267 -0.11897944 -0.04664307
## 3 -0.13781479  0.97280405  0.09080812
## 4 -0.67231955  0.46294127  0.02679507
## 5 -0.42367631 -0.08839685 -0.13312255
## 6  0.50518690 -0.83938009  0.32754430
## 7  1.08076911 -0.37828447 -0.03913655
## 8  0.11170739  0.12791071 -0.09153707
## 
## Clustering vector:
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##    2    7    1    1    8    8    5    3    8    7    3    7    2    1    2    6 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##    2    1    1    7    5    3    8    1    2    2    4    2    8    3    7    4 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##    2    1    6    2    6    2    3    3    6    5    3    7    4    3    5    5 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##    2    6    1    1    4    5    8    7    6    5    3    4    5    8    8    2 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
##    1    4    2    1    6    2    5    1    8    6    4    4    3    2    1    6 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
##    6    6    4    4    2    2    7    2    5    7    5    6    4    8    1    8 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
##    4    2    5    8    7    3    4    1    5    7    7    5    4    5    5    6 
##  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128 
##    6    3    4    7    5    2    6    8    3    7    6    2    2    1    7    2 
##  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144 
##    8    4    4    7    3    1    4    4    6    8    2    5    7    2    6    1 
##  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160 
##    2    3    8    3    3    5    8    1    1    2    1    8    6    4    7    5 
##  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  176 
##    2    1    8    1    1    8    4    2    5    6    4    5    7    8    8    5 
##  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  192 
##    1    2    4    2    3    2    3    7    5    2    4    7    8    8    4    7 
##  193  194  195  196  197  198  199  200  201  202  203  204  205  206  207  208 
##    2    3    7    3    8    2    7    2    7    2    4    2    8    1    4    1 
##  209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  224 
##    6    7    3    8    8    1    6    2    7    1    7    1    4    6    1    8 
##  225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  240 
##    8    3    7    2    1    3    7    4    7    2    7    8    6    7    8    8 
##  241  242  243  244  245  246  247  248  249  250  251  252  253  254  255  256 
##    8    4    1    4    7    4    2    1    3    2    2    7    5    8    4    8 
##  257  258  259  260  261  262  263  264  265  266  267  268  269  270  271  272 
##    5    1    4    3    5    6    4    4    7    5    8    3    7    1    8    7 
##  273  274  275  276  277  278  279  280  281  282  283  284  285  286  287  288 
##    4    7    2    5    8    6    1    5    6    2    8    3    5    5    5    4 
##  289  290  291  292  293  294  295  296  297  298  299  300  301  302  303  304 
##    2    2    6    7    1    8    1    8    7    4    4    7    6    1    2    5 
##  305  306  307  308  309  310  311  312  313  314  315  316  317  318  319  320 
##    7    7    8    5    8    3    5    8    7    6    1    3    8    1    1    4 
##  321  322  323  324  325  326  327  328  329  330  331  332  333  334  335  336 
##    5    2    6    7    2    8    5    4    2    3    6    7    3    5    7    2 
##  337  338  339  340  341  342  343  344  345  346  347  348  349  350  351  352 
##    8    8    4    4    2    2    7    4    1    1    7    4    3    8    3    4 
##  353  354  355  356  357  358  359  360  361  362  363  364  365  366  367  368 
##    2    7    5    8    4    7    4    4    2    2    2    3    1    1    2    4 
##  369  370  371  372  373  374  375  376  377  378  379  380  381  382  383  384 
##    4    3    7    4    8    7    2    5    8    7    1    7    2    2    3    6 
##  385  386  387  388  389  390  391  392  393  394  395  396  397  398  399  400 
##    6    2    4    3    1    1    8    4    8    5    8    1    7    8    2    4 
##  401  402  403  404  405  406  407  408  409  410  411  412  413  414  415  416 
##    8    7    1    8    7    2    3    8    1    6    7    3    7    5    8    1 
##  417  418  419  420  421  422  423  424  425  426  427  428  429  430  431  432 
##    5    1    4    8    2    4    8    4    2    8    7    4    4    5    3    7 
##  433  434  435  436  437  438  439  440  441  442  443  444  445  446  447  448 
##    1    2    8    4    7    5    8    5    8    2    8    3    6    6    4    3 
##  449  450  451  452  453  454  455  456  457  458  459  460  461  462  463  464 
##    3    8    2    8    7    1    4    3    4    1    1    2    6    6    1    2 
##  465  466  467  468  469  470  471  472  473  474  475  476  477  478  479  480 
##    8    8    6    8    6    5    3    6    1    6    7    5    7    5    2    3 
##  481  482  483  484  485  486  487  488  489  490  491  492  493  494  495  496 
##    5    4    7    8    5    1    6    6    8    2    1    2    8    5    2    8 
##  497  498  499  500  501  502  503  504  505  506  507  508  509  510  511  512 
##    8    5    2    4    4    2    2    5    5    2    4    5    7    1    8    5 
##  513  514  515  516  517  518  519  520  521  522  523  524  525  526  527  528 
##    6    6    6    8    3    7    8    6    6    4    5    4    2    7    4    2 
##  529  530  531  532  533  534  535  536  537  538  539  540  541  542  543  544 
##    7    3    1    5    1    7    2    3    4    6    1    1    3    3    7    1 
##  545  546  547  548  549  550  551  552  553  554  555  556  557  558  559  560 
##    2    3    2    7    2    4    4    6    3    7    7    1    8    5    4    7 
##  561  562  563  564  565  566  567  568  569  570  571  572  573  574  575  576 
##    6    5    4    1    2    1    6    2    6    3    6    7    4    4    2    3 
##  577  578  579  580  581  582  583  584  585  586  587  588  589  590  591  592 
##    3    8    7    8    8    4    6    4    2    2    2    4    4    7    6    8 
##  593  594  595  596  597  598  599  600  601  602  603  604  605  606  607  608 
##    4    7    2    4    1    7    4    5    5    3    8    3    3    8    6    8 
##  609  610  611  612  613  614  615  616  617  618  619  620  621  622  623  624 
##    1    2    5    4    3    1    6    2    1    3    8    6    1    3    4    7 
##  625  626  627  628  629  630  631  632  633  634  635  636  637  638  639  640 
##    2    5    6    2    4    7    8    2    7    5    4    6    3    7    3    5 
##  641  642  643  644  645  646  647  648  649  650  651  652  653  654  655  656 
##    8    5    7    7    7    2    5    7    5    2    5    2    6    1    7    1 
##  657  658  659  660  661  662  663  664  665  666  667  668  669  670  671  672 
##    8    1    3    4    3    5    7    1    1    1    5    1    3    1    1    4 
##  673  674  675  676  677  678  679  680  681  682  683  684  685  686  687  688 
##    4    3    4    3    8    5    2    5    1    4    1    7    2    5    6    2 
##  689  690  691  692  693  694  695  696  697  698  699  700  701  702  703  704 
##    3    6    7    6    5    2    6    1    4    8    7    6    7    1    3    7 
##  705  706  707  708  709  710  711  712  713  714  715  716  717  718  719  720 
##    6    7    4    4    2    7    1    1    7    5    7    6    8    5    7    6 
##  721  722  723  724  725  726  727  728  729  730  731  732  733  734  735  736 
##    5    1    4    7    8    8    8    1    1    6    2    2    4    5    1    5 
##  737  738  739  740  741  742  743  744  745  746  747  748  749  750  751  752 
##    8    4    2    4    6    5    1    2    1    2    5    4    2    5    3    8 
##  753  754  755  756  757  758  759  760  761  762  763  764  765  766  767  768 
##    4    2    5    3    8    6    3    6    6    5    2    5    5    8    3    1 
##  769  770  771  772  773  774  775  776  777  778  779  780  781  782  783  784 
##    3    7    3    8    7    6    2    1    4    7    5    2    4    1    3    2 
##  785  786  787  788  789  790  791  792  793  794  795  796  797  798  799  800 
##    3    5    4    5    8    8    3    2    5    4    4    5    5    3    2    1 
##  801  802  803  804  805  806  807  808  809  810  811  812  813  814  815  816 
##    3    2    1    5    7    8    8    2    5    7    2    4    6    2    6    3 
##  817  818  819  820  821  822  823  824  825  826  827  828  829  830  831  832 
##    7    1    3    3    1    4    4    1    1    4    4    7    5    5    5    4 
##  833  834  835  836  837  838  839  840  841  842  843  844  845  846  847  848 
##    5    4    7    8    1    5    5    5    5    2    5    1    3    4    7    2 
##  849  850  851  852  853  854  855  856  857  858  859  860  861  862  863  864 
##    8    2    6    2    4    2    7    3    7    1    5    8    7    7    3    4 
##  865  866  867  868  869  870  871  872  873  874  875  876  877  878  879  880 
##    3    3    4    6    8    5    2    6    5    1    2    7    6    8    1    7 
##  881  882  883  884  885  886  887  888  889  890  891  892  893  894  895  896 
##    1    7    4    8    7    7    2    5    1    7    1    1    2    7    1    5 
##  897  898  899  900  901  902  903  904  905  906  907  908  909  910  911  912 
##    6    4    7    5    8    2    4    1    2    7    2    1    8    3    1    1 
##  913  914  915  916  917  918  919  920  921  922  923  924  925  926  927  928 
##    8    3    2    8    1    3    7    8    7    8    5    5    4    1    8    3 
##  929  930  931  932  933  934  935  936  937  938  939  940  941  942  943  944 
##    2    3    5    8    4    2    4    6    3    8    6    4    1    5    7    2 
##  945  946  947  948  949  950  951  952  953  954  955  956  957  958  959  960 
##    7    7    4    7    1    4    4    7    2    4    1    5    1    1    1    8 
##  961  962  963  964  965  966  967  968  969  970  971  972  973  974  975  976 
##    6    2    3    1    5    8    1    1    5    3    6    1    1    2    5    7 
##  977  978  979  980  981  982  983  984  985  986  987  988  989  990  991  992 
##    6    7    1    3    8    7    2    4    8    7    2    2    6    2    5    8 
##  993  994  995  996  997  998  999 1000 
##    2    2    7    8    6    1    6    7 
## 
## Within cluster sum of squares by cluster:
## [1] 797.7227 949.7918 641.1214 793.4957 737.4096 520.4053 918.1900 761.7654
##  (between_SS / total_SS =  31.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

We can observe that the eight different clusters have been found with sizes 136, 149, 100, 132,93, 139 and 129 respectively, and the within cluster sum of square (WCSS) = 31.9%. which is higher than 2 and 4 clusters result which means 2,4 clusters are better in terms of compactness or homogeneity compared to the clustering result of 8 clusters.

Cluster Plot:

# 2- visualize clustering and install package
library(factoextra)
fviz_cluster(kmeansresult, data = cdataset)

It’s clear that the eight clusters are overlapping.

Average Silhouette Coefficient:

#Average silhouette
library(cluster)
avg_sil <- silhouette(kmeansresult$cluster, dist(cdataset))
# k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
##   cluster size ave.sil.width
## 1       1  136          0.12
## 2       2  149          0.09
## 3       3  100          0.08
## 4       4  132          0.12
## 5       5  122          0.10
## 6       6   93          0.13
## 7       7  139          0.06
## 8       8  129          0.12

An Average Silhouette Coefficient of 0.1 indicates that, the clusters formed in the clustering process have some degree of similarity among their data points. However, the result is lower than 2 clusters which has silhouette coefficient average of 0.11 and also lower than K=4 clusters that is equal to 0.12.

BCubed precision and recall:

# Cluster assignments and ground truth labels
cluster_assignments <- kmeansresult$cluster
ground_truth <- dataset$Risk

# Function to calculate BCubed precision and recall
calculate_bcubed_metrics <- function(cluster_assignments, ground_truth) {
  n <- length(cluster_assignments)
  precision_sum <- 0
  recall_sum <- 0

  for (i in 1:n) {
    cluster <- cluster_assignments[i]
    label <- ground_truth[i]

    # Count the number of items from the same category within the same cluster
    same_category_same_cluster <- sum(ground_truth[cluster_assignments == cluster] == label)

    # Count the total number of items in the same cluster
    total_same_cluster <- sum(cluster_assignments == cluster)

    # Count the total number of items with the same category
    total_same_category <- sum(ground_truth == label)

    # Calculate precision and recall for the current item and add them to the sums
    precision_sum <- precision_sum + same_category_same_cluster / total_same_cluster
    recall_sum <- recall_sum + same_category_same_cluster / total_same_category
  }
  precision <- precision_sum / n  # Calculate average precision 
  recall <- recall_sum / n        # Calculate average recall

  return(list(precision = precision, recall = recall)) }

# Calculate BCubed precision and recall
precision_recall <- calculate_bcubed_metrics(cluster_assignments, ground_truth)

# Extract precision and recall from the metrics
precision <- precision_recall$precision
recall <- precision_recall$recall

# Print the results
cat(" BCubed Precision:", precision, "\n","BCubed Recall:", recall)
##  BCubed Precision: 0.3747497 
##  BCubed Recall: 0.1554135

The calculated precision value is 0.37478 not a high value it mean the clusters are not pure.

The calculated recall value is 0.15541 it’s a low result meaning most of the data are not in the same cluster.

Conclusion of K=8:

Is not a good number of clusters especially when compared to the results obtained with K=2 and K=4 clusters. This conclusion is based on various evaluation metrics such as the average silhouette coefficient, within-cluster sum of squares, and Bcubed precision and recall. In all aspects, K=8 performed the worst. Additionally, considering the presence of class labels and our prior knowledge of the data set, we know the actual number of groups within the class label. So, by also taking this information into account, we can determine that K=8 is not an optimal number of clusters.

Validation:

library(NbClust)
#a)fviz_nbclust() with silhouette method using library(factoextra) 
fviz_nbclust(cdataset, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

#b) NbClust validation
fres.nbclust <- NbClust(cdataset, distance="euclidean", min.nc = 2, max.nc = 10, method="kmeans", index="all")
## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced
## Warning: did not converge in 10 iterations

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 3 proposed 3 as the best number of clusters 
## * 8 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 2 proposed 9 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  4 
##  
##  
## *******************************************************************

According to the NbClust validation method, which utilizes the majority rule, the best number of clusters is 4. This number contradicts the initial suggestion from the silhouette method, which indicated that the best number of clusters is 2. However, upon revisiting the calculations and evaluating the performance, it is almost accurate to conclude that K=4 indeed performs the best among the considered options.


6- Evaluation and Comparison

Classification Metrics Analysis

80 %t raining set 20% testing set: 70% raining set 30% testing set: 60% raining set 40% testing set:
IG IG ratio Gini Index IG IG ratio Gini Index IG IG ratio Gini Index
Accuracy 82.27% 81.01% 54.75% 78.07% 79.39% 60.31% 78.37% 78.38% 57.39%
Sensitivity 80.28% 78.26% 51.02% 77.23% 78.00% 55.10% 80.60% 73.72% 50.81%
Specificity 82.85% 81.78% 56.42% 78.31% 79.78% 62.78% 77.60% 79.92% 60.14%
Precision 75.00% 71.05% 54.35% 72.90% 72.90% 64.29% 72.26% 74.19% 53.41%

The Information Gain (IG) model, when trained with an 80% training set, stands out due to its exceptional accuracy (82.27%) and sensitivity (80.28%). This suggests that the model is adept at correctly identifying patients with a risk of ASCVD, a critical factor in preventative health measures. The high specificity (82.85%) further underscores the model’s ability to discern true negatives, minimizing false alarms and unnecessary treatments.

Overall Comparison


In each partition based on the metrics results, the decision tree algorithm demonstrated varying degrees of performance. The 80-20 split emerged as the most favorable for all metrics, showcasing the highest accuracy, balanced sensitivity and specificity, and commendable precision. This suggests that the model trained on 80% of the data and tested on the remaining 20% achieved the most reliable predictions for the 10-year ASCVD risk.

Comparing the three splits, the 80-20 configuration consistently outperformed the others, making it the preferred choice among the algorithms. It demonstrated superior accuracy, successfully navigating the intricacies of the dataset and maintaining a balance between correctly identifying positive and negative instances. While the 70-30 and 60-40 splits exhibited respectable performances, they fell short of the comprehensive reliability achieved by the 80-20 split.

In summary, the 80-20 split with the decision tree algorithm emerged as the optimal configuration, providing the most accurate and balanced predictions for the 10-year ASCVD risk. This analysis underscores the importance of careful consideration in choosing the training and testing split, with the 80-20 partition demonstrating its efficacy as the best-performing algorithm among all evaluated configurations.

Clustering Metrics Analysis

We tested 3 different number of clusters : K=2, K=4, K=8

K= 2 K= 4 K= 8
Average Silhouette width 0.11 0.12 0.1
Total within-cluster sum of square 11.2% 22.5% 31.9%
BCubed precision 0.3299589 0.336335 0.3747497
BCubed recall 0.5317886 0.2729542 0.1554135
Visualization

Overall Comparison

In an overall comparison of the clustering results based on the metrics provided (Average Silhouette Width, Total Within-Cluster Sum of Square, BCubed Precision, and BCubed Recall) for different numbers of clusters (K=2, K=4, K=8)

we can say that K=4 has the highest average silhouette width, indicating well-defined and distinct clusters. The total within-cluster sum of square is higher, suggesting that the clusters are less compact. The precision is slightly better than K=2, but the recall is lower .based on the result of average silhouette width It suggests that partitioning the data into four clusters is favorable.

K=2 The average silhouette width is moderate, indicating some separation between clusters. it has the lowest total within-cluster sum of square percentage, suggests that the two clusters are relatively compact. The precision is decent, suggesting that the instances within each cluster are somewhat similar. However, the high recall indicates that there might be some instances that are not well captured by the clusters.

K=8 has the lowest average silhouette width, indicating less separation between clusters. The total within-cluster sum of square is the highest, suggesting that the clusters are less compact. The precision is the highest, but the recall is the lowest among the three. This may imply that while the instances within each cluster are similar, many relevant instances are missed.

K=4 a good choice as it has a higher average silhouette width compared to K=2.

K=8, based on these metrics, appears to have less favorable results compared to K=2 and K=4.

Comparison: Classification vs. Clustering

In this study, classification algorithms consistently outperform clustering algorithms in accurately predicting outcomes based on the provided features. While clustering may reveal inherent patterns and groupings within the data, it might not be as effective in predicting specific classes as witnessed in classification. Therefore, for this dataset and problem, classification appears to be the more suitable approach.

7- Findings

In the beginning, we selected a dataset that represents a 1000 generated samples with different kinds of health condition to predict the probability of having a 10-year ASCVD risk.

To ensure the highest level of efficiency and the most accurate results, we implemented a series of preprocessing steps. By using clear visual representations such as boxplots and histograms, we were able to get a clear picture of our data’s characteristics. This allowed us to effectively identify and remove any irregularities, such as missing information or statistical outliers, which could potentially distort our results. We then applied normalization and data balancing techniques, which adjusted the scales of our data features to a uniform range, and discretized the continuous ‘Risk’ variable into distinct categories, thereby simplifying the interpretation of risk levels for our classification tasks.

With our data prepared, we embarked on the core tasks of classification and clustering. Our tool of choice for the former was the decision tree model, tested across 3 different splits of training and testing data to get the best model’s accuracy. Our techniques yielded the following results:

for the classification, the 80-20 split with the decision tree algorithm provided the most accurate in all models especially in the The Information Gain (IG) model stands out due to its exceptional accuracy (82.27%), Sensitivity, Specificity and Precision. the key findings for the tree are :

  • Age as a Significant Predictor: Across all trees, age consistently appears as a significant factor and serving as the root node. This underlines the model’s reliance on age as a primary risk indicator,

  • Systolic Blood Pressure and HDL Cholesterol: These two health metrics are frequently used as secondary splits following age, indicating their importance in cardiovascular risk assessment. Higher systolic blood pressure is generally associated with higher risk, while higher HDL cholesterol levels often indicate lower risk.

  • Diabetes and smoking status further refine risk predictions, with diabetic individuals generally at a higher risk.

  • The presence of hypertension, especially in combination with other risk factors like high cholesterol, elevates the risk level.

As for clustering, we utilized the K-means clustering algorithm with different values of K to determine the optimal number of clusters. and evaluated the performance of each K value by analyzing various metrics, including the average silhouette width. Here are the key findings:

  • Among the tested K values, K=4 yielded the most favorable result.

  • The average silhouette width for K=4 was calculated to be 0.12, indicating better separation between clusters compared to other K values.

  • Following the majority rule, the optimal number of clusters for the dataset was determined to be 4.

  • Analyzing the scree plot, a notable observation is that the total within-cluster sum of squares (WCSS) decreases as the number of clusters increases. The selection of the optimal number of clusters is determined by identifying an “elbow” point.

    # Decide how many clusters to look at
    n_clusters <- 10
    
    # Initialize total within sum of squares error: wss
    wss <- numeric(n_clusters)
    
    set.seed(123)
    
    # Look over 1 to n possible clusters
    for (i in 1:10) {
      # Fit the model: km.out
      km.out <- kmeans(cdataset, centers = i, nstart = 20)
      # Save the within cluster sum of squares
      wss[i] <- km.out$tot.withinss
    }
    
    # Produce a scree plot
    wss_df <- tibble(clusters = 1:10, wss = wss)
    
    scree_plot <- ggplot(wss_df, aes(x = clusters, y = wss, group = 1)) +
        geom_point(size = 4)+
        geom_line() +
        scale_x_continuous(breaks = c(2, 4, 6, 8, 10)) +
        xlab('Number of clusters')
    scree_plot

    scree_plot +
        geom_hline(
            yintercept = wss, 
            linetype = 'dashed', 
            col = c(rep('#000000',3),'#FF0000', rep('#000000', 6))
        )

    The identified elbow point corresponds to K=4, indicating that the WCSS decreases at a slower rate beyond this number of clusters. Thus, K=4 is considered the suitable number of clusters based on this criterion.

All these mentioned findings highlight the effectiveness of utilizing the K-means algorithm with K=4 in achieving the highest level of separation among the clusters under consideration.

In summary, both the supervised learning model (Classification) and the unsupervised learning model (Clustering) played crucial roles in predicting the 10-year ASCVD risk in adults using key features, contributing to the successful accomplishment of our goal.

The supervised learning model (Classification), benefiting from the inclusion of the class label “Risk” in the dataset, proved to be more accurate, precise, and suitable for the task.

On the other hand, the unsupervised learning model (Clustering) encountered challenges in achieving pure clusters due to the absence of labeled data. Despite this limitation, it still provided valuable insights into the underlying patterns and structures within the dataset.

8- References

[1] “Data Preprocessing in R,” Engineering Education (EngEd) Program | Section. https://www.section.io/engineering-education/data-preprocessing-in-r/

[2] “K-Means Clustering in R with Step by Step Code Examples,” www.datacamp.com. https://www.datacamp.com/tutorial/k-means-clustering-r

[3] M. Sarah, “A Comprehensive Guide to Cluster Analysis: Applications, Best Practices and Resources,” Displayr, Jun. 06, 2023. https://www.displayr.com/understanding-cluster-analysis-a-comprehensive-guide/

[4] “RPubs - Data Mining: Classification with Decision Trees,” rpubs.com. https://rpubs.com/kjmazidi/195428

[5 ] “RPubs - Classification and Regression Trees (CART) in R,” rpubs.com. https://rpubs.com/camguild/803096